第一版 2026
First Edition 2026
版权所有 © BPB Publications,印度
Copyright © BPB Publications, India
ISBN:978-93-65898-385
ISBN: 978-93-65898-385
版权所有。未经出版商事先书面许可,不得以任何形式或任何方式复制、分发或传播本出版物的任何部分,也不得将其存储在数据库或检索系统中。但程序清单除外,程序清单可以输入、存储和在计算机系统中执行,但不得以出版、影印、录制或任何电子和机械方式复制。
All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means.
责任限制和免责声明
LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY
本书所载信息据作者和出版商所知均属真实准确。作者已尽一切努力确保本书内容的准确性,但出版商不对因本书任何信息而造成的任何损失或损害承担责任。
The information contained in this book is true and correct to the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but the publisher cannot be held responsible for any loss or damage arising from any information in this book.
本书中提及的所有商标均归其各自所有者所有,但 BPB 出版社无法保证此信息的准确性。
All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information.
献给
Dedicated to
我的父母、我的妻子、我的孩子,还有母亲
My parents, my wife, my kids, and the mother
Indrajit Kar是一位杰出的AI思想领袖,著有5本AI/ML书籍,是一位创新者和作家,拥有超过22年的经验,致力于推动各行业变革性的AI产品和平台。在他的职业生涯中,他领导过众多高影响力团队,负责开发AI、ML、GenAI和数据科学领域的端到端解决方案,指导项目从概念化和设计到部署和扩展的整个过程。
Indrajit Kar is a distinguished AI thought leader, author of 5 AI/ML books, innovator, and author with over 22 years of experience driving transformative AI-led products and platforms across industries. Throughout his career, he has led numerous high-impact teams responsible for developing end-to-end solutions in AI, ML, GenAI, and data science - guiding projects from conceptualization and design to deployment and scaling.
作为现任人工智能主管,Indrajit 领导着多个大型项目,为众多全球客户带来可衡量的业务影响。他的工作根植于其在全人类人工智能 (GenAI)、大型语言模型( LLM ) 架构、MLOps、自然语言处理和计算机视觉等领域的深厚技术专长。他在将 LLM 和自主人工智能代理集成到电子商务、医疗保健、生命科学、电信和制造业等多个行业的实际应用中发挥了关键作用。
In his current role as head of AI, Indrajit spearheads large-scale initiatives that deliver measurable business impact across a diverse portfolio of global clients. His work is rooted in deep technical expertise across GenAI, large language model (LLM) architectures, MLOps, natural language processing, and computer vision. He has played a key role in integrating LLMs and autonomous AI agents into real-world applications spanning sectors such as e-commerce, healthcare, life sciences, telecommunications, and manufacturing.
Indrajit 同时也是企业高管的战略顾问和合作伙伴,致力于帮助企业通过先进的人工智能产品和平台转型释放商业价值。他始终引领潮流,弥合前沿研究与企业级实施之间的鸿沟,加速人工智能在各组织中的应用。
Indrajit is also a strategic advisor and collaborator to C-level executives, helping enterprises unlock business value through advanced AI product and platform transformations. His leadership consistently bridges the gap between cutting-edge research and enterprise-scale implementation, accelerating AI adoption across organizations.
作为人工智能领域公认的权威人物,Indrajit著有两本书,其中一本专门探讨了基因人工智能及其行业应用。他在人工智能研究领域也贡献卓著,发表了27篇以上的论文,申请了21项专利,并荣获多项殊荣,包括来自知名会议和机构的八项最佳论文奖。他的研究常常探索创新、可扩展性和负责任的人工智能之间的交集。
A recognized voice in the AI community, Indrajit has authored two books, including one dedicated to GenAI and its industry applications. He has also contributed extensively to AI research, with 27+ published papers, 21 patents filed, and multiple accolades, including eight Best Paper Awards from reputed conferences and institutions. His work often explores the intersections of innovation, scalability, and responsible AI.
凭借在领导研发项目以及为财富 500 强企业管理人工智能服务和产品化工作的丰富经验,Indrajit 将继续引领智能系统的未来发展。他对创新的热情,以及对符合伦理且可扩展的人工智能的愿景,驱动着他通过变革性技术赋能企业和社区的使命。
With a legacy of leading R&D programs and having managed AI services and productization efforts for Fortune 500 companies, Indrajit continues to shape the future of intelligent systems. His passion for innovation, combined with a vision for ethical and scalable AI, drives his mission to empower businesses and communities through transformative technology.
Dhanveer Singh 是美国第一资本银行的一位技术领导者,在金融服务、保险和零售行业拥有超过 19 年的软件工程、云架构和大规模系统现代化经验。他擅长 AWS、微服务、容器化、DevOps、大数据和人工智能/机器学习,致力于打造安全、高性能的平台,处理数十亿笔交易,服务全球数百万用户。
作为云原生架构和自动化领域的倡导者,Dhanveer 领导了云成本优化、弹性工程和网络安全自动化方面的变革性举措,显著提升了效率并推动了企业数字化转型。他还拥有多项数据集成、转换、数据安全和云自动化领域的专利,彰显了他对创新的重视。
除了在技术领导方面做出贡献外,他还担任国际期刊和会议的审稿人和技术程序委员会成员,担任全球 IT 和网络安全奖项的评委,并通过 STEM 和 CodeDay 项目提供指导。
Dhanveer 是 IETE 和 IAENG 的会士,也是 IEEE 和 ACM 的活跃成员。
Dhanveer Singh is a technology leader at Capital One USA with over 19 years of experience in software engineering, cloud architecture, and large-scale system modernization across financial services, insurance, and retail. He specializes in AWS, microservices, containerization, DevOps, big data, and AI/ML, delivering secure, high-performing platforms that process billions of transactions and serve millions worldwide.
An advocate of cloud-native architectures and automation, Dhanveer has led transformative initiatives in cloud cost optimization, resilience engineering, and cybersecurity automation, driving measurable efficiency and advancing enterprise digital transformation. He has also filed multiple patents in areas of data integration, transformation, data security, and cloud automation, underscoring his focus on innovation.
Beyond his technical leadership, he contributes as a reviewer and TPC member for international journals and conferences, serves as a judge for global IT and cybersecurity awards, and mentors through STEM and CodeDay programs.
Dhanveer is a Fellow of IETE and IAENG, and an active IEEE and ACM member.
哈文德拉·辛格是一位杰出的技术领袖,专长于云工程、架构、自动化和人工智能解决方案。他运用 Azure、.NET、C#、Python、GCP、Kubernetes、Databricks 和其他前沿技术,设计并实现可扩展、安全的系统。凭借在云原生应用、微服务、事件驱动架构和分布式系统方面的专业知识,哈文德拉致力于推动云和人工智能生态系统的创新,提供能够创造商业价值并实现可持续增长的高影响力解决方案。
Harvendra Singh is a distinguished technology leader specializing in cloud engineering, architecture, automation, and AI-powered solutions. He designs and implements scalable, secure systems utilizing Azure, .NET, C#, Python, GCP, Kubernetes, Databricks, and other cutting-edge technologies. With expertise in cloud-native applications, microservices, event-driven architectures, and distributed systems, Harvendra drives innovation in cloud and AI ecosystems, delivering high-impact solutions that drive business value and sustainable growth.
马尼什·贾恩 (Manish Jain)是 Firstsource Solutions 的副总裁兼人工智能架构负责人,负责领导财富 100 强企业的企业级人工智能转型。他拥有超过 20 年的技术领导经验,其中包括十余年推动先进人工智能创新,并因此被公认为能够带来可量化业务影响的变革性解决方案架构师。除了公司职责外,他还担任 Deeplearning.ai 的技术顾问,并在 Analytics Vidhya 担任导师。此外,他还为 Manning 等知名人工智能出版社审稿。以及 Packt 出版社,使他成为研究与企业实际应用交叉领域的佼佼者。Manish 兼具深厚的技术专长和卓越的领导才能,能够指导企业完成人工智能转型中的战略和运营层面的工作。他对推动人工智能社区发展的承诺体现在他的咨询和指导工作中,以及他参与同行评审期刊的发表。
这些经验使他成为人工智能转型的必要性以及在复杂的企业环境中扩展人工智能的实际挑战方面的权威,始终将创新与可衡量的结果联系起来。
Manish Jain is the vice president and head of AI architecture at Firstsource Solutions, where he leads enterprise-wide AI transformation for Fortune 100 organizations. With more than 20 years of technology leadership, including over a decade driving advanced AI innovation, he has earned recognition as an architect of transformative solutions that deliver quantifiable business impact. In addition to corporate responsibilities, his acts as a technical consultant for Deeplearning.ai and mentors at Analytics Vidhya. He also serves as a manuscript reviewer for prominent AI publishers such as Manning and Packt, positioning him at the crossroads of research and practical enterprise applications. Manish is unique blend of deep technical expertise and proven executive leadership enables him to guide organizations through the strategic and operational aspects of AI transformation. His commitment to advancing the AI community is evident in his advisory and mentoring roles, as well as his involvement in peer-reviewed publishing.
These experiences make a compelling authority on the imperatives of AI transformation and the practical challenges of scaling AI across complex enterprise environments, consistently linking innovation with measurable outcomes.
我衷心感谢我的家人、父母、妻子、岳父母和孩子们,他们坚定不移的鼓励和信任是我这段旅程的基石。衷心感谢BPB出版社的耐心和信任,使得本书得以分册出版,全面深入地涵盖了人工智能领域瞬息万变的方方面面。我也感谢我的公司,感谢他们为我提供发展机会,让我得以开发基因人工智能和智能体应用,这些都为本书分享的见解提供了宝贵的素材。对于所有支持我的人,无论你们是否在场,你们的指导和鼓励都深深地影响了我的这段旅程,对此我将永远铭记于心。
I extend my deepest appreciation to my family, parents, wife, in-laws, and children, whose steadfast encouragement and belief in me have been the cornerstone of this journey. Heartfelt thanks to BPB Publications for their patience and trust, allowing the book’s multi-part publication to thoroughly cover the dynamic field of AI. I am also grateful to my companies for fostering growth and providing opportunities to develop GenAI and agentic applications, which informed the insights shared here. To everyone who supported me, seen and unseen, your guidance and encouragement have profoundly shaped this journey, for which I am eternally thankful.
我们生活在智能协作时代,人工智能不再仅仅是工具,而是能够检索知识、生成想法、推理问题并跨文本、图像和语音等多种模态进行交互的伙伴。多模态和智能体应用的出现标志着我们构建、部署和依赖人工智能的方式发生了转折。
We are living in the age of intelligent collaboration, where AI is no longer just a tool, but a partner capable of retrieving knowledge, generating ideas, reasoning through problems, and interacting across modalities like text, images, and voice. The emergence of multimodal and agentic applications marks a turning point in how we build, deploy, and rely on AI.
本书《构建多模态生成式人工智能和智能体应用》是一本实用指南,旨在帮助读者超越理论,真正构建未来人工智能系统。全书共18章,循序渐进地从基础知识入手,逐步深入到高级实现,首先介绍检索、生成和编排;然后讲解结合文本、图像和语音的多模态工作流程;最后探讨文本到SQL系统、OCR、欺诈检测和人工智能运维等实际应用。
This book, Building Multimodal Generative AI and Agentic Applications, is a practical guide for those who want to move beyond theory and actually build the future of AI systems. Across 18 chapters, you will move step-by-step from fundamentals to advanced implementations, starting with retrieval, generation, and orchestration; progressing into multimodal workflows that combine text, images, and voice; and then advancing toward real-world applications like text-to-SQL systems, OCR, fraud detection, and AI operations.
每一章都注重实践性和易懂性。您将找到概念解释、系统设计原则、代码示例以及练习题,这些练习将引导您进行实验并在实践中学习。
Every chapter is designed to be hands-on and approachable. You will find conceptual explanations, system design principles, code walkthroughs, and to do exercises that push you to experiment and learn by doing.
本书的目标不仅是解释这些系统是如何工作的,而且还要赋予你构建自己的可扩展、多模态和智能AI应用程序的能力,这些应用程序可靠、安全且具有影响力。
The goal of this book is not only to explain how these systems work, but also to empower you to build your own scalable, multimodal, and agentic AI applications, applications that are reliable, safe, and impactful.
无论您是工程师、研究人员还是技术领导者,我都希望这本书能为您提供塑造下一代人工智能所需的知识、信心和灵感。
Whether you are an engineer, researcher, or leader in technology, I hope that this book equips you with the knowledge, confidence, and inspiration to shape the next-generation of AI.
第一章:新时代生成式人工智能简介 ——本章介绍现代人工智能系统的关键组成部分。首先概述生成式人工智能,然后探讨检索系统、生成系统及其各自的优势。本章阐述了检索增强生成(RAG )如何将两者结合起来,以及编排如何帮助不同的人工智能组件协同工作。此外,本章还解释了标记、向量数据库和重排序方法,以及双向编码器和交叉编码器之间的区别。最后,本章讨论了诸如安全使用人工智能的防护措施、代理的作用以及模型上下文协议的重要性等重要主题。
Chapter 1: Introducing New Age Generative AI - This chapter introduces the key building blocks of modern AI systems. It begins with an overview of generative AI and then explores retrieval systems, generation systems, and the strengths of each. It covers how retrieval-augumented generation (RAG) generation combines the two, and how orchestration helps different AI components work together. The chapter also explains tokens, vector databases, and reranking methods, along with the differences between bi-encoders and cross-encoders. Finally, it discusses essential topics like guardrails for safe AI use, the role of agents, and the importance of Model Context Protocols.
第二章:深入探讨多模态系统——本章重点介绍视觉语言模型及其在多模态人工智能中的作用。它解释了什么是视觉语言模型,比较了不同的实现方法,并探讨了它们与更广泛的多模态通用人工智能系统的区别。本章还更深入地研究了视觉语言模型,并介绍了基于输出对多模态系统进行分类的方法。
Chapter 2: Deep Dive into Multimodal Systems - This chapter focuses on vision-language models and their role in multimodal AI. It explains what vision-language models are, compares different implementation approaches, and explores how they differ from broader multimodal GenAI systems. The chapter also looks at vision-language models in more depth and introduces ways to classify multimodal systems based on their outputs.
第三章:实现单模态本地GenAI系统——本章探讨构建GenAI系统的实践方面。首先介绍GPU在当今人工智能领域的作用以及如何利用本地GPU。然后介绍Ollama,包括如何使用它生成PDF文档。接下来,解释RAG的工作原理,以及有效实现RAG所面临的关键挑战。
Chapter 3: Implementing Unimodal Local GenAI System - This chapter explores the practical side of building GenAI systems. It begins with the role of GPUs in today’s AI landscape and how to make use of a local GPU. The chapter then introduces Ollama, including how to generate a PDF document with it. Moving forward, it explains how RAG works, along with the key challenges involved in implementing RAG effectively.
第四章:实现基于单模态 API 的 GenAI 系统- 本章将通过实践操作,介绍如何使用 OpenAI 的 API 和模型。它将讲解如何从使用 OpenAI 完成基本任务逐步过渡到构建更高级的智能体 AI 解决方案。您将学习如何执行多文档查询,如何使用 OpenAI 和 Faiss 实现模块化的检索增强型生成系统,并探索一系列扩展这些功能的步骤。
Chapter 4: Implementing Unimodal API-based GenAI Systems - This chapter provides a hands-on introduction to working with OpenAI’s APIs and models. It explains how to move from using OpenAI for basic tasks to building more advanced agentic AI solutions. You will learn how to perform multi-document queries, implement a modular retrieval-augmented generation system using OpenAI and Faiss, and explore a set of to do steps for extending these capabilities further.
第五章:实现人机协同的智能生成人工智能系统——本章重点介绍智能生成人工智能系统的设计和开发。首先阐述此类系统的架构原则,然后详细介绍端到端的人机协同(HITL )RAG工作流程。接下来,探讨HITL设置如何演进为多智能体HITL RAG系统。最后,本章阐明智能人工智能和人工智能代理之间的区别,重点介绍它们各自的角色和应用。
Chapter 5: Implementing Agentic GenAI Systems with Human-in-the-loop - This chapter focuses on designing and advancing agentic generative AI systems. It starts with principles of architecting such systems and then walks through an end-to-end human-in-the-loop (HITL) RAG workflow. From there, it explores how HITL setups can evolve into multi-agent HITL RAG systems. The chapter concludes by clarifying the differences between agentic AI and AI agents, highlighting their distinct roles and applications..
第六章:两阶段和多阶段GenAI系统 ——本章深入探讨了密集检索系统中交互的概念及其在RAG(评级、可用性、可寻址)中的重要性。它解释了交互模型在两阶段RAG系统中的作用,并比较了不同的重排序策略,包括延迟交互、完全交互和多向量模型。本章随后介绍了两阶段和多阶段RAG架构,讨论了用于评估检索结果的评分机制,并演示了如何实现具有路由的多阶段RAG工作流程,以获得更准确、更高效的响应。
Chapter 6: Two and Multi-stage GenAI Systems - This chapter provides a deep understanding of the concepts of interactions within dense retrieval systems and their importance in RAG. It explains the role of interaction models in two-stage RAG systems and compares different reranking strategies, including late interaction, full interaction, and multi-vector models. The chapter then introduces two-stage and multi-stage RAG architectures, discusses grading mechanisms for evaluating retrieved results, and demonstrates how to implement a multi-stage RAG workflow with routing for more accurate and efficient responses.
第七章:构建双向多模态检索系统 ——本章介绍多模态系统及其基于输出的分类方法。随后,本章阐述多模态检索系统的工作原理,并提供带有详细步骤说明的代码实现。本章最后附有练习题,供读者应用所学知识并加深理解。
Chapter 7: Building a Bidirectional Multimodal Retrieval System - This chapter introduces multimodal systems and how they can be classified based on their outputs. It then explains the working of a multimodal retrieval system and provides a code implementation with step-by-step explanation. The chapter closes with a to do section, giving readers practical exercises to apply and deepen their understanding.
第八章:构建多模态 RAG 系统——本章重点介绍使用 LLM 进行生成和评估的实用方法。首先介绍生成技术的实现,然后介绍 LLM 作为评判者的概念及其在构建推荐系统中的应用。本章还涵盖如何将评分机制与 OpenAI 集成以改进评估。最后,本章附有练习题,供读者将这些理念应用于实践。
Chapter 8: Building a Multimodal RAG System - This chapter focuses on practical approaches to generation and evaluation using LLMs. It begins with the implementation of generation techniques, followed by an introduction to the concept of LLM-as-a-judge and its application in building recommender systems. The chapter also covers how to incorporate grading mechanisms with OpenAI to improve evaluation. It concludes with a to do section, giving readers exercises to apply these ideas in practice.
第九章:基于重排序的GenAI系统构建——本章探讨了重排序的概念及其在改进检索和RAG系统中的关键作用。它解释了重排序如何在基于文本和多模态的环境中应用,重点介绍了如何在多模态RAG中使用交叉编码器。本章还介绍了多模态环境下的交叉编码器架构以及RAG系统中的多索引嵌入思想。除了这些概念之外,本章还提供了带有详细解释的代码实现,并在最后附有练习部分,以帮助读者实践并巩固理解。
Chapter 9: Building GenAI Systems with Reranking - This chapter explores the concept of reranking and its critical role in improving retrieval and RAG systems. It explains how reranking is applied in both text-based and multimodal contexts, with a focus on using cross-encoders in multimodal RAG. The chapter also introduces the cross-encoder architecture in multimodal settings and the idea of multi-index embedding within RAG systems. Alongside these concepts, it provides a code implementation with detailed explanation and concludes with a to do section to help readers practice and solidify their understanding.
第十章:多模态GenAI的检索优化 ——本章探讨如何提高检索系统的效率和效果。首先概述检索系统的常见缺陷,然后介绍各种优化技术来解决这些局限性。本章还详细探讨了检索优化,展示了如何应用这些方法来提升性能。随后,本章重点关注多模态RAG系统,解释了自适应索引刷新如何提高其准确性和响应速度。最后,本章提供了一个练习部分,供读者将这些理念应用于实践。
Chapter 10: Retrieval Optimization for Multimodal GenAI - This chapter examines how to make retrieval systems more efficient and effective. It begins by outlining common drawbacks of retrieval systems, then introduces various optimization techniques to address these limitations. The chapter also explores retrieval optimization in detail, showing how these methods can be applied to improve performance. It then shifts focus to multimodal RAG systems, explaining how adaptive index refresh can enhance their accuracy and responsiveness. Finally, it provides a to do section with exercises for readers to apply these ideas in practice.
第十一章:构建以语音为输入的多模态GenAI系统——本章探讨了RAG如何超越图像和文本的范畴。它介绍了将RAG扩展到其他模态的核心概念,并展示了如何将语音接口集成到RAG架构中。本章还提供了一个支持语音的RAG系统的分步代码实现,演示了如何将这些理念付诸实践。
Chapter 11: Building Multimodal GenAI Systems with Voice as Input - This chapter explores how RAG extends beyond just image and text. It introduces the core concepts of expanding RAG to other modalities and shows how speech interfaces can be integrated into the RAG architecture. The chapter also provides a step-by-step code implementation of a voice-enabled RAG system, demonstrating how to bring these ideas into practice.
第十二章:高级多模态GenAI系统——本章重点阐述推理在GenAI系统中的重要性。它解释了GenAI中使用的不同类型的推理,以及它们对于构建更可靠、更智能的模型为何至关重要。本章还介绍了用于评估AI系统推理能力的关键基准。
Chapter 12: Advanced Multimodal GenAI Systems - This chapter highlights the importance of reasoning in GenAI systems. It explains the different types of reasoning used in GenAI and why they matter for building more reliable and intelligent models. The chapter also introduces key benchmarks that are used to evaluate reasoning capabilities in AI systems.
第十三章:高级多模态GenAI系统实现——本章重点探讨如何通过有效的提示技术增强GenAI的推理能力。随后,本章将探索在不同阶段引入推理的专用架构——首先是在重排序阶段,用于优化结果;其次是在推荐阶段,推理有助于提供更准确、更具上下文感知能力的建议。
Chapter 13: Advanced Multimodal GenAI Systems Implementation - This chapter focuses on how reasoning can be enhanced in GenAI through effective prompting techniques. It then explores specialized architectures that bring reasoning into play at different stages—first during reranking, where results are refined, and then at the recommendation stage, where reasoning helps deliver more accurate and context-aware suggestions.
第十四章:构建文本到 SQL 系统- 本章深入探讨文本到 SQL 的复杂性,以及它为何被认为是一个具有挑战性的问题。本章首先解释基本概念,然后探讨文本到 SQL 能够产生重大影响的实际应用。本章讨论了其中的关键挑战,并提供了设计高效文本到 SQL 系统的实用指导。此外,本章还介绍了使用大型语言模型进行实体提取的方法,重点阐述了如何将其与文本到 SQL 集成以提高性能。最后,本章重点介绍了此类系统如何提高数据可访问性和可读性,同时还介绍了性能指标和最佳实践,以确保可靠性。
Chapter 14: Building Text-to-SQL Systems - This chapter delves into the complexities of text-to-SQL and why it is considered a challenging problem. It begins by explaining the basic concepts and then explores real-world applications where text-to-SQL can make a significant impact. The chapter discusses the key challenges involved, followed by practical guidance on designing an effective text-to-SQL system. It also covers entity extraction using large language models, highlighting how this integrates with text-to-SQL to improve performance. Finally, the chapter emphasizes how such systems can enhance data accessibility and literacy, while also introducing performance metrics and best practices to ensure reliability.
第十五章:智能文本转SQL系统及架构决策——本章介绍专为实时零售智能而设计的智能文本转SQL系统的设计和实现。本章详细解释了系统的架构,并提供了代码演练以帮助读者更好地理解。此外,本章还提供了一个逐步流程图,展示了系统如何处理查询并生成有意义的输出。最后,本章展示了文本转SQL系统生成的实际结果,以及这些结果如何解决最初的问题。
Chapter 15: Agentic Text-to-SQL Systems and Architecture Decision-Making - This chapter presents the design and implementation of an agentic text-to-SQL system tailored for real-time retail intelligence. It explains the system’s architecture in detail, along with code walkthroughs for better understanding. A step-by-step pipeline is provided to show how the system processes queries, leading to meaningful outputs. The chapter concludes by demonstrating the actual results generated by the text-to-SQL system and how they address the original problem statement.
第十六章:利用 GenAI 从图像中提取文本——本章介绍了三种应用 GenAI 进行光学字符识别 (OCR) 的不同方法。它解释了 OCR 如何处理图像,以及如何将其扩展到包含文本、图像和其他元素的多模态文档。本章最后附有练习题,供读者应用和巩固所学知识。
Chapter 16: GenAI for Extracting Text from Images - This chapter introduces three different approaches to applying GenAI for optical character recognition. It explains how OCR works on images, as well as how it can be extended to multimodal documents that combine text, images, and other elements. The chapter concludes with a to do section, giving readers practical exercises to apply and reinforce what they have learned.
第十七章:将传统人工智能/机器学习集成到 GenAI 工作流程中——本章通过一个详细的案例研究,探讨如何将传统机器学习模型集成到 GenAI 工作流程中。本章展示了一个混合集成学习在电信欺诈检测中的实际应用案例,说明了如何将 XGBoost 等模型封装并增强到基于机器学习的系统中。此外,本章还对机器学习模型与 GenAI 结合以创建混合解决方案的不同方法进行了比较概述。最后,本章提供了一个实践练习部分,为读者提供了加深理解的实践活动。
Chapter 17: Integrating Traditional AI/ML into GenAI Workflow - This chapter explores how traditional machine learning models can be integrated into GenAI workflows through a detailed case study. It presents a practical use case of hybrid ensemble learning for telecom fraud detection, showing how models like XGBoost can be wrapped and enhanced within an LLM-powered system. The chapter also provides a comparative overview of different ways ML models can be combined with GenAI to create hybrid solutions. It concludes with a to do section, offering readers hands-on activities to deepen their understanding.
第十八章:LLM运维与GenAI评估技术——本章重点阐述运维在构建和运行生产级GenAI应用过程中的重要性。它比较了LLM和RAG系统的评估方法,介绍了RagOps的概念,并强调了持续监控和可观测性平台的重要性。本章还探讨了图增强型RAG如何改进推荐系统,并对现代软件开发中不同的运维实践进行了比较。最后,本章提供了关于如何设置MLflow以管理实验和部署的实用指南。
Chapter 18: LLM Operations and GenAI Evaluation Techniques - This chapter highlights the importance of operations in building and running production-grade GenAI applications. It compares evaluation methods for LLMs and RAG systems, introduces the concept of RagOps, and emphasizes the need for continuous monitoring and observability platforms. The chapter also explores how graph-enhanced RAG can improve recommendation systems and provides a comparison of different Ops practices in modern software development. Finally, it offers practical guidance on setting up MLflow for managing experiments and deployments.
请点击链接下载
Please follow the link to download the
代码包和本书的彩色图片:
Code Bundle and the Coloured Images of the book:
本书的代码包也托管在 GitHub 上,地址为https://github.com/bpbpublications/Building-Multimodal-Generative-AI-and-Agentic-Applications。如果代码有任何更新,都会在现有的 GitHub 仓库中更新。
The code bundle for the book is also hosted on GitHub at https://github.com/bpbpublications/Building-Multimodal-Generative-AI-and-Agentic-Applications. In case there’s an update to the code, it will be updated on the existing GitHub repository.
我们丰富的书籍和视频资源库中包含代码包,可在https://github.com/bpbpublications获取。快来看看吧!
We have code bundles from our rich catalogue of books and videos available at https://github.com/bpbpublications. Check them out!
勘误表
Errata
BPB 出版社对我们的工作引以为豪,并遵循最佳实践,以确保内容的准确性,从而为订阅用户提供沉浸式的阅读体验。读者是我们的镜子,我们会利用他们的反馈来反思和改进出版过程中可能出现的人为错误。为了帮助我们保持质量,并及时联系任何可能因意外错误而遇到困难的读者,请写信至:
We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at :
BPB 出版社全体员工非常感谢您的支持、建议和反馈。
Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
1. Introducing New Age Generative AI
2. Deep Dive into Multimodal Systems
3. Implementing Unimodal Local GenAI System
4. Implementing Unimodal API-based GenAI Systems
5. Implementing Agentic GenAI Systems with Human-in-the-loop
6. Two and Multi-stage GenAI Systems
7. Building a Bidirectional Multimodal Retrieval System
8. Building a Multimodal RAG System
9. Building GenAI Systems with Reranking
13. Advanced Multimodal GenAI Systems Implementation
14. Building Text-to-SQL Systems
15. Agentic Text-to-SQL Systems and Architecture Decision-Making
18. LLM Operations and GenAI Evaluation Techniques
本章通过介绍基本概念和基础技术,为掌握新一代生成式人工智能(GenAI )系统奠定基础。我们首先探讨检索系统和生成系统的区别,然后深入研究向量数据库、搜索算法、嵌入技术、索引和重排序,这些都是构建智能高效人工智能解决方案的关键要素。此外,我们还讨论了反射和防护机制等关键可靠性机制,以确保输出结果的稳健性并与用户意图保持一致。
This chapter sets the stage for mastering new age generative AI (GenAI) systems by introducing essential concepts and foundational technologies. We begin by exploring the difference between retrieval systems and generation systems, followed by an in-depth look at vector databases, search algorithms, embedding techniques, indexing, and reranking, all critical for building intelligent, efficient AI solutions. Key reliability mechanisms, such as reflection and guardrails, are discussed to ensure outputs remain robust and aligned with user intent.
接下来,我们将深入探讨诸如思维链(CoT )等高级提示方法,以引导人工智能模型进行结构化推理。本章随后转向智能体人工智能,涵盖智能体、工具、推理、规划和动作执行,并扩展到能够执行复杂协作任务的多智能体系统的设计。本章还提供了大型语言模型(LLM )、大型视觉模型(LVM )和新兴的大型动作模型(LAM )的比较概述,以及关于本地模型部署和图形处理器(GPU )基础设施规划的实用见解。
We then dive into advanced prompting methods like chain of thought (CoT) to guide AI models through structured reasoning processes. Moving into agentic AI, the chapter covers agents, tools, reasoning, planning, and action execution, expanding into the design of multi-agent systems capable of complex, collaborative tasks. A comparative overview of large language models (LLMs), large vision models (LVMs), and emerging large action models (LAMs) is provided, along with practical insights into local model deployment and graphics processing unit (GPU) infrastructure planning.
此外,我们还介绍了语音技术,包括自动语音识别(ASR )和语音生成,并阐述了内存管理在基于代理的架构中的关键作用。最后,我们介绍了模型上下文协议(MCP )等行业标准,并区分了GenAI开发人员和GenAI工程师不断变化的职责,为读者进行高级系统设计做好准备。
Further, we introduce speech technologies, including automated speech recognition (ASR) and generation, and explain the critical role of memory management in agent-based architectures. Finally, we present industry standards like Model Context Protocol (MCP) and differentiate the evolving responsibilities of a GenAI developer vs. a GenAI engineer, preparing readers for advanced system design.
本章涵盖以下主题:
This chapter covers the following topics:
本章旨在帮助读者全面理解设计和部署现代全智能(GenAI)系统所需的关键构建模块。通过探讨检索和生成系统、向量数据库、嵌入技术、高级提示策略、智能体架构和多智能体协作等概念,读者将为构建智能、可扩展的人工智能解决方案奠定坚实的基础。此外,本章还介绍了本地模型部署、GPU 基础设施、语音处理、智能体内存管理以及 MCP 等行业标准等关键主题。这些基础要素对于构建多模态、可靠且可用于生产环境的人工智能应用至关重要。
This chapter aims to equip readers with a comprehensive understanding of the key building blocks essential for designing and deploying modern GenAI systems. By exploring concepts such as retrieval and generation systems, vector databases, embedding techniques, advanced prompting strategies, agentic architectures, and multi-agent collaboration, readers will gain a strong foundation for building intelligent, scalable AI solutions. Additionally, the chapter introduces critical topics like local model deployment, GPU infrastructure, speech processing, memory management in agents, and industry standards like MCPs. These foundational elements are crucial for advancing toward multimodal, reliable, and production-ready AI applications.
生成模型的演进代表了人工智能领域最重要的范式转变之一。在预训练Transformer (GPT )出现之前的时代,生成人工智能(GenAI)的发展主要得益于玻尔兹曼机、变分自编码器(VAE )、生成对抗网络(GAN )和自编码器等强大技术。这些模型通过生成图像、音频甚至文本等非结构化数据,取得了突破性的成果。例如,GAN彻底革新了逼真图像合成,而VAE则实现了复杂数据空间(包括语音和文档生成)的概率生成建模。
The evolution of generative models represents one of the most significant paradigm shifts in AI. In the pre-generative pre-trained transformers (GPTs) era, GenAI was shaped by powerful techniques such as Boltzmann machines, variational autoencoders (VAEs), generative adversarial networks (GANs), and autoencoders. These models achieved groundbreaking results by generating unstructured data like images, audio, and even text. For instance, GANs revolutionized realistic image synthesis, while VAEs enabled probabilistic generative modeling of complex data spaces, including speech and document generation.
尽管这些早期系统令人印象深刻,但它们通常专注于单一领域的生成,推理、规划或跨任务泛化的能力有限。它们缺乏现代人工智能体验所具备的丰富的上下文理解能力、动态推理能力和任务驱动的灵活性。
While impressive, these earlier systems generally focused on single-domain generation with limited ability to reason, plan, or generalize across tasks. They lacked the rich contextual understanding, dynamic reasoning, and task-driven flexibility that define modern AI experiences.
真正的范式转变并非直接源于 GPT 模型,而是源于 2017 年 Transformer 架构本身的引入(Vaswani 等人在开创性论文《Attention Is All You Need 》中提出)。Transformer 引入了自注意力、并行处理和位置编码的概念,使得模型在规模和能力上都能大规模扩展,远远超出了基于传统循环神经网络( RNN )、长短期记忆网络( LSTM ) 或卷积神经网络( CNN ) 的生成模型的局限性。
The true paradigm shift occurred not directly with GPT models, but with the introduction of the transformer architecture itself in 2017 (in the seminal paper Attention Is All You Need by Vaswani et al.). The transformer introduced the concepts of self-attention, parallel processing, and positional encoding, enabling models to scale massively in both size and capability, far beyond the limits of traditional recurrent neural networks (RNNs), long short-term memories (LSTMs), or convolutional neural networks (CNNs) based generative models.
在Transformer模型的基础上,GPT开启了开放式生成模型的时代,这些模型不仅能够重现数据,还能执行对话、推理、摘要、代码生成和多模态合成等任务。现代GenAI系统如今展现出语义感知、动态问题解决能力以及跨文本、图像和语音的多模态理解能力。
Building on the transformer foundation, GPTs ushered in the era of open-ended generation models capable of not just recreating data but performing tasks like conversation, reasoning, summarization, code generation, and multimodal synthesis. The modern GenAI systems now exhibit semantic awareness, dynamic problem-solving, and multimodal understanding across text, images, and speech.
新时代以以下几项关键进步为标志:
Several key advancements define this new age, which are as follows:
注:本书仅聚焦于新一代生成人工智能(GenAI)系统。如果您想了解包括玻尔兹曼机、自编码器、变分自编码器(VAE)和生成对抗网络(GAN)在内的传统生成模型的基础知识,可以参考我和我的合著者撰写的另一本书,名为《学习Python生成人工智能:从自编码器到Transformer再到大型语言模型》(由BPB出版社出版)。该书详细介绍了经典生成建模的发展历程,并最终引向当今的前沿系统。 Note: The scope of this book is focused exclusively on new-age GenAI systems. If you seek to explore the foundations of older generative models, including Boltzmann machines, autoencoders, VAEs, and GANs, you can refer to another book authored by me and my co-author, titled "Learn Python Generative AI: Journey from Autoencoders to Transformers to Large Language Models" (published by BPB Publications). It provides a detailed walkthrough of the classical generative modelling journey leading to today's cutting-edge systems. |
本书超越了传统的人工智能生成方式,着重探讨如何设计、构建和部署面向推理、规划和行动的全息人工智能(GenAI)——这些系统正在改变着各行各业、企业以及人们的日常生活。理解这一转变至关重要:最初只是数据模仿,如今已发展成为能够增强和自动化人类思维的智能多模态代理。
In this book, we move beyond classical generation, focusing on designing, building, and deploying reasoning, planning, and action-oriented GenAI—the systems that are now transforming industries, enterprises, and everyday experiences. Understanding this transition is key: what started as data mimicry has evolved into intelligent, multimodal agents capable of augmenting and automating human thought itself.
虽然生成模型已经发展到能够创造出丰富且类人的输出,但并非所有人工智能解决方案都仅仅依赖于生成。事实上,当今许多最强大的人工智能系统都将检索与生成相结合,以使输出结果与现实世界的信息紧密相连,从而提高可靠性并减少虚假信息。
While generative models have evolved to create rich, human-like outputs, not all AI solutions rely solely on generation. In fact, many of the most powerful AI systems today combine retrieval with generation to ground their outputs in real-world information, improve reliability, and reduce hallucinations.
在探讨生成策略之前,首先必须了解检索系统,它是人工智能查找、筛选并将相关知识引入对话的核心。检索是现代人工智能基础设施的关键支柱,支持从搜索引擎和推荐系统到高级检索增强生成(RAG )流程等各种任务。
Before exploring generation strategies, it is essential to first understand retrieval systems, the backbone of how AI finds, filters, and brings relevant knowledge into the conversation. Retrieval forms a critical pillar of modern AI infrastructure, supporting tasks ranging from search engines and recommendation systems to advanced retrieval-augmented generation (RAG) pipelines.
在下一节中,我们将探讨什么是检索系统,它们与纯粹的生成模型有何不同,以及为什么它们对于构建准确、可扩展和生产级的 AI 应用程序来说是必不可少的。
In the next section, we will explore what retrieval systems are, how they differ from pure generative models, and why they are indispensable for building accurate, scalable, and production-grade AI applications.
如今,人工智能系统因其创造力和推理能力而备受赞誉,但许多智能行为的背后都建立在强大的检索机制之上。检索机制通常是人工智能的隐形引擎,它使人工智能能够将输出结果与现实世界的知识联系起来,找到相关事实,并在对话或任务中保持逻辑一致性。要真正理解检索机制如何成为现代人工智能的关键支柱,首先需要了解它的演变历程,从简单的关键词匹配到复杂、学习驱动和记忆增强的技术。
GenAI systems today are celebrated for their creativity and reasoning abilities, but behind many of these intelligent behaviors lies a strong foundation built on retrieval mechanisms. Retrieval is often the hidden engine that allows AI to ground its outputs in real-world knowledge, find relevant facts, and maintain coherence across conversations or tasks. To truly appreciate how retrieval has become such a critical pillar of modern AI, it is important to first understand how it evolved, from simple keyword matching to sophisticated, learning-driven, and memory-augmented techniques.
在了解现代检索系统之前,简要回顾一下它们的发展历程是有帮助的,如下表所示:
Prior to understanding modern retrieval systems, it is helpful to trace their evolution briefly, which is discussed in the following table:
|
年 Year |
里程碑 Milestone |
描述 Description |
|
1970年代-2000年代 1970s-2000s |
词频-逆文档频率(TF-IDF ),最佳匹配 25 (BM25 )。 Term frequency–inverse document frequency (TF-IDF), Best Matching 25 (BM25). |
早期的基于关键词的检索方法侧重于匹配精确的词语。 Early keyword-based retrieval methods focused on matching exact terms. |
|
2020 2020 |
密集通道检索(DPR ) Dense passage retrieval (DPR) |
引入密集嵌入以在语义上匹配问题和文档。 Introduced dense embeddings to semantically match questions and documents. |
|
2021 2021 |
混合检索 Hybrid retrieval |
结合稀疏(BM25)和稠密(DPR)方法来提高鲁棒性。 Combined sparse (BM25) and dense (DPR) methods to improve robustness. |
|
2020–2022 2020–2022 |
抹布 RAG |
将检索与生成模型紧密结合,以增强接地效果。 Tight integration of retrieval with generation models to enhance grounding. |
|
2023年及以后 2023+ |
情境学习检索,记忆增强检索。 In-context learning retrieval, memory-augmented retrieval. |
动态的、推理驱动的检索功能嵌入到 LLM 工作流程中。 Dynamic, reasoning-driven retrieval embedded inside LLM workflows. |
Table 1.1: Historic timelines of retrieval systems
结合表 1.1中给出的背景信息,我们可以清楚地看到,检索不再是简单的查找过程;它已经演变为一个动态的、智能的层,能够主动增强人工智能系统的推理能力。在接下来的章节中,我们将探讨检索系统的工作原理、使其功能强大的关键组件,以及它们如何与生成模型无缝集成,从而构建可靠的、具有上下文感知能力的人工智能应用。
With the preceding background, given in Table 1.1, in mind, it becomes clear that retrieval is no longer a simple lookup process; it has evolved into a dynamic, intelligent layer that actively augments the reasoning capabilities of AI systems. In the following sections, we will explore how retrieval systems work, the key components that make them powerful, and how they integrate seamlessly with generative models to build reliable, context-aware AI applications.
现代检索系统的基础可以追溯到早期的一些创新,例如Facebook AI Research(现为Meta AI)在2020年前后推出的DPR。与传统的稀疏检索方法(例如TF-IDF和BM25)相比,DPR是一项重大突破,因为它为查询和文档都引入了密集向量表示。这使得语义检索成为可能,能够基于语义而非仅仅依赖关键词重叠来查找信息。
The foundation of modern retrieval systems can be traced back to early innovations like DPR, introduced by Facebook AI Research (now Meta AI) around 2020. DPR was a major breakthrough compared to traditional sparse retrieval methods (such as TF-IDF and BM25) because it introduced dense vector representations for both queries and documents. This allowed semantic retrieval, finding information based on meaning rather than relying purely on keyword overlap.
密集检索标志着一个重要的转折点:模型现在可以将查询和文档的含义编码到一个共享的嵌入空间中,从而可以高效地计算相似度。密集检索不再匹配精确的词语,而是匹配概念和思想。然而,早期的密集检索器仍然存在局限性:由于语义匹配粗糙,它们有时会检索到不相关的段落;而且,要将它们扩展到数百万甚至数十亿份文档,就需要解决效率和延迟方面棘手的工程难题。
Dense retrieval marked a major turning point: models could now encode the meaning of a query and a document into a shared embedding space where similarity could be computed efficiently. Instead of matching exact words, dense retrieval matched concepts and ideas. However, early dense retrievers still had limitations: they sometimes retrieved irrelevant passages due to coarse semantic matching, and scaling them to millions or billions of documents required solving difficult engineering challenges around efficiency and latency.
稀疏检索方法,例如 TF-IDF 和 BM25,依赖于精确匹配关键词和词频统计。虽然这些方法出现较早,但在精确度至关重要且查询与特定术语紧密相关的场景中,例如法律文件检索、科学文献检索和企业文档检索,它们仍然非常有效,因为在这些场景中,精确匹配比一般的语义相似性更为重要。稀疏检索方法与传统的倒排索引技术结合使用也非常高效,并且在许多实际的搜索系统中仍然是强有力的基础。
Sparse retrieval methods like TF-IDF and BM25 rely on matching exact keywords and term frequency statistics. While older, they remain highly effective in cases where precision is critical and queries are closely tied to specific terminology, such as in legal document search, scientific literature, and enterprise document retrieval, where exact matches matter more than general semantic similarity. Sparse retrieval also scales very efficiently with traditional inverted index techniques and remains a strong baseline in many real-world search systems.
密集检索方法,例如 DPR 和近似最近邻负对比学习密集文本检索( ANCE )模型的引入,标志着检索方式从稀疏词项匹配技术(例如 BM25)向基于语义向量的检索方式的重大转变。密集检索器在处理开放域搜索、歧义查询或同义词和释义常见的场景时表现出色,例如在客户支持机器人、多语言检索或语义常见问题解答( FAQ ) 匹配中。密集检索使系统能够理解问题背后的意图,即使查询和文档中的确切词语有所不同。下图展示了使用向量数据库的语义检索的基本流程:
Dense retrieval methods, introduced with models like DPR and Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE), marked a major shift from sparse term-matching techniques (e.g., BM25) toward semantic vector-based retrieval. Dense retrievers excel when dealing with open-domain search, ambiguous queries, or when synonyms and paraphrases are common, for example, in customer support bots, multilingual retrieval, or semantic frequently asked questions (FAQs) matching. Dense retrieval allows systems to understand the intent behind a question, even when the exact words differ between the query and the document. The following figure shows the basic flow of semantic retrieval using a vector database:
Figure 1.1: Basic flow of semantic retrieval using a vector database
|
注:为保持清晰简洁,本图将文档分块和嵌入作为整个 RAG 流程的一部分进行展示。实际上,这些步骤(文档分块和嵌入)是在索引阶段离线执行的,而不是在实时查询执行期间执行。本书各章节中所有图表和工作流程均采用此简化方式。 Note: To maintain clarity and simplicity, this figure illustrates document chunking and embedding as part of the overall RAG process. In practice, these steps—chunking and embedding of documents- are performed offline during the indexing phase and not during real-time query execution. This simplification applies across all figures and workflows presented in the chapters of this book. |
下图展示了 RAG 流水线的离线阶段。在该阶段,原始文档首先使用语言分块工具(例如基于 Llama 的解析器或 LangChain 工具)进行处理,将其分割成易于管理的片段。然后,这些片段通过嵌入模型(例如 OpenAI 的嵌入 API)生成密集向量表示。生成的嵌入存储在向量数据库中,形成可搜索的索引,为实时查询执行期间的下游检索提供支持。此预处理步骤对于在多模态或基于语言学习模型 (LLM) 的应用中实现快速、可扩展且语义丰富的文档检索至关重要。
The following figure illustrates the offline phase of a RAG pipeline, where raw documents are first processed using language chunking tools (e.g., Llama-based parsers or LangChain utilities) to divide them into manageable segments. These chunks are then passed through an embedding model, such as OpenAI’s embedding API, to generate dense vector representations. The resulting embeddings are stored in a vector database, forming the searchable index that powers downstream retrieval during real-time query execution. This preprocessing step is critical to enabling fast, scalable, and semantically rich document retrieval in multimodal or LLM-based applications.
Figure 1.2: Offline document indexing and embedding workflow
回顾发展历程,如今的检索系统已经远远超越了早期的DPR架构:
Reflecting on the evolution, today’s retrieval systems have dramatically advanced beyond the early DPR architecture:
此外,向量数据库技术发展迅速。Facebook AI Similarity Search ( Faiss )、Milvus、Qdrant、Azure AI Search 和 Pinecone 等工具提供可扩展的高速向量搜索,支持数十亿个嵌入,并具备近似最近邻( ANN ) 算法、元数据过滤和混合检索功能——所有这些对于驱动现代企业级 RAG 系统都至关重要。
Additionally, vector database technology has matured rapidly. Tools like Facebook AI Similarity Search (Faiss), Milvus, Qdrant, Azure AI Search, and Pinecone offer scalable, high-speed vector search, supporting billions of embeddings with approximate nearest neighbor (ANN) algorithms, metadata filtering, and hybrid retrieval capabilities—all critical for powering modern enterprise-grade RAG systems.
必须认识到,如今的检索不再仅仅是获取文档。它已发展成为一种智能增强机制,涉及过滤、重排序、推理和动态知识基础构建。检索正从后端查找服务演变为下一代人工智能的前端推理组件。
It is crucial to recognize that retrieval today is no longer just about fetching documents. It has become an intelligent augmentation mechanism, involving filtering, reranking, reasoning, and dynamic knowledge grounding. Retrieval is evolving from a backend lookup service into a frontline reasoning component of next-generation AI.
因此,深入理解检索,不仅将其视为一种搜索技术,而且将其视为一种智能增强策略,对于构建可靠、可扩展和目标驱动的新一代 GenAI 应用至关重要。
Thus, understanding retrieval deeply, not simply as a search technique but as an intelligent augmentation strategy, is essential for building reliable, scalable, and goal-driven new-age GenAI applications.
检索系统通常基于召回率@k、精确率@k 和平均倒数排名( MRR ) 等指标进行评估,这些指标衡量系统在搜索结果前列中检索相关文档的效率。我们将在后续章节中更详细地介绍检索评估,但现在需要记住的是,检索质量取决于准确率和排序效率。
Retrieval systems are typically evaluated based on metrics like recall@k, precision@k, and Mean Reciprocal Rank (MRR), which measure how effectively the system retrieves relevant documents among the top results. We will cover retrieval evaluation in greater detail later, but for now, it is important to remember that retrieval quality is judged by both accuracy and ranking efficiency.
正如我们所见,检索系统侧重于查找最相关的现有信息。然而,许多现实世界的任务需要的不仅仅是检索——它们还需要创建、推理和原创性综合。这正是生成系统发挥作用的地方。
As we have seen, retrieval systems focus on finding the most relevant existing information. However, many real-world tasks demand more than just retrieval—they require creation, reasoning, and original synthesis. This is where generation systems come into play.
在本节中,我们将探讨生成系统的定义、运行机制以及其核心技术。我们将讨论不同类型的生成任务,例如文本、图像和音频的生成,并理解自回归建模、扩散模型和采样策略等关键机制。此外,我们还将介绍温度控制、提示设计以及创造性和事实性之间的平衡等重要概念。
In this section, we will explore what generation systems are, how they operate, and the core techniques that power them. We will discuss different types of generation tasks, such as text, image, and audio creation, and understand key mechanisms like autoregressive modeling, diffusion models, and sampling strategies. Additionally, we will cover important concepts like temperature control, prompt design, and the balance between creativity and factuality.
我们还将探讨生成系统面临的典型挑战,例如幻觉、连贯性问题和安全风险,并重点介绍这些系统真正擅长的领域,尤其是在需要开放式创造力或复杂问题解决能力的任务中。最后,我们将简要介绍现代人工智能架构中如何日益融合检索和生成功能,以构建更贴近现实、更智能的系统。
We will also examine the typical challenges faced by generation systems, such as hallucination, coherence issues, and safety risks, and highlight where these systems truly excel, especially in tasks that demand open-ended creativity or complex problem-solving. Finally, we will briefly introduce how retrieval and generation are increasingly being combined in modern AI architectures to build more grounded and intelligent systems.
让我们首先了解生成系统的基本性质,以及它们与纯粹基于检索的方法有何不同。
Let us begin by understanding the fundamental nature of generation systems and how they differ from purely retrieval-based approaches.
生成系统是一种人工智能模型,其设计目的是生成新内容,而不仅仅是检索现有内容。它们可以通过学习训练数据中的复杂模式来生成文本、图像、音频、代码,甚至是多模态输出。与检索(仅呈现已存在的信息)不同,生成系统使模型能够在推理时动态地编写新句子、创造新图像并解决新问题。
Generation systems are AI models designed to produce new content, rather than simply retrieve it. They can generate text, images, audio, code, and even multimodal outputs by learning complex patterns from training data. Unlike retrieval, which surfaces information that already exists, generation enables models to compose new sentences, invent new images, and solve new problems dynamically at inference time.
现代生成系统通常是大规模神经网络或逻辑层模型(LLM),它们使用跨多个领域的海量数据集,并经过数十亿个参数的训练。下图展示了逻辑层模型和生成模型的类型:
Modern generation systems are typically large-scale neural networks or LLMs trained with billions of parameters on massive datasets across multiple domains. The following figure shows the types of LLMs and generation models:
GenAI系统涵盖多种模态,每种模态都旨在根据用户输入创建文本、图像或音频等内容,展现了现代机器学习(ML )模型的多功能性和强大功能。让我们来看看生成系统的类型:
GenAI systems span multiple modalities, each designed to create content such as text, images, or audio based on user input, showcasing the versatility and power of modern machine learning (ML) models. Let us look at the types of generation systems:
该生成过程背后的核心技术如下:
Core techniques behind the generation are as follows:
在自回归模型(例如 GPT)中,每个输出标记都是逐个生成的,并以先前生成的标记为条件。这种逐个标记的顺序生成方式使得模型能够产生高度一致的输出,但如果管理不当,也可能导致误差累积。下图解释了 LLM 如何以自回归的方式(一次生成一个标记)进行生成:
In autoregressive models (like GPT), each output token is generated one at a time, conditioned on previously generated tokens. This sequential token-by-token generation allows models to produce highly coherent outputs, but can also lead to error accumulation if not managed carefully. The following figure explains how LLM generates in an autoregressive manner (one token at a time):
Figure 1.4: LLM generation in an autoregressive manner (one token at a time)
以下是温度和采样策略:
The following are the temperature and sampling strategies:
调整这些参数可以对人工智能生成过程中的创造性和精确性进行精细控制。
Tuning these parameters allows fine control over creativity vs. precision in AI generation.
提示对于引导生成系统的行为至关重要。诸如CoT之类的高级提示技术能够通过鼓励模型在回答问题之前解释其思考过程,从而实现多步骤推理。我们将在下一节中详细解释这些技术。
Prompts are critical for steering the behavior of generation systems. Advanced prompting techniques like CoT enable multi-step reasoning by encouraging models to explain their thought process before answering. We will explain these in more detail in the next section.
发电系统在以下方面尤其强大:
Generation systems are particularly powerful in the following:
尽管内容生成系统在创造新内容方面功能强大,但它们有时难以保证事实准确性、知识更新及时,也无法将输出结果与现实世界的信息联系起来。为了克服这些挑战,现代人工智能架构越来越多地将检索和生成功能的优势结合起来,从而催生了一种被称为 RAG 的强大范式。
While generation systems are incredibly powerful at creating new content, they sometimes struggle with factual accuracy, up-to-date knowledge, and grounding their outputs in real-world information. To overcome these challenges, modern AI architectures increasingly combine the strengths of retrieval and generation, giving rise to a powerful paradigm known as RAG.
在下一节中,我们将探讨 RAG 系统如何工作,为什么它们对于构建可靠的 AI 应用程序至关重要,以及它们如何将检索和生成无缝集成到统一的智能工作流程中。
In the next section, we will explore how RAG systems work, why they are critical for building reliable AI applications, and how they seamlessly integrate retrieval and generation into a unified, intelligent workflow.
RAG 是一种先进的人工智能架构,它将检索和生成整合到一个统一的工作流程中。RAG 系统并非仅仅依赖模型内部的知识(这些知识可能过时或不完整),而是首先检索相关的外部信息,然后根据检索到的内容生成答案。
RAG is an advanced AI architecture that combines retrieval and generation into a unified workflow. Instead of relying solely on a model's internal knowledge (which may be outdated or incomplete), a RAG system first retrieves relevant external information and then generates an answer conditioned on that retrieved content.
RAG 的出现是为了解决纯发电模式面临的主要挑战,这些挑战如下:
RAG emerged to address key challenges faced by pure generation models, which are as follows:
RAG 弥合了这些差距,使输出结果更加准确、更贴近实际、更及时。
RAG bridges these gaps, making outputs more accurate, grounded, and up-to-date.
RAG系统通常包括以下两个主要步骤:
A RAG system typically involves two major steps, which are as follows:
因此,该模型并非仅由记忆生成;它是先读取,然后推理。
Thus, the model is not generated from memory alone; it is reading first, then reasoning.
以下列表概述了基本的 RAG 流水线结构:
The following list outlines how a basic RAG pipeline looks like:
如今,根据检索和生成过程的协调方式,出现了许多不同类型的 RAG 架构。但是,为了保持讨论范围的集中性,以下介绍两种最常见且实用的架构:
There are many different types of RAG architectures evolving today, depending on how retrieval and generation are orchestrated. However, to keep the scope focused, the following are the two most common and practical ones:
下图展示了一种单级 RAG 架构:
The following figure shows a single-stage RAG architecture:
下图展示了一种两阶段 RAG 架构:
The following figure shows a two-stage RAG architecture:
以下是两种迭代式 RAG 算法:
The following are the two iterative RAG:
向量数据库是高效 RAG 系统的关键基础设施。
Vector databases are critical infrastructure for efficient RAG systems.
ANN 算法具有可扩展性,能够快速找到足够接近的向量,而不是精确匹配,从而实现对数百万或数十亿份文档的实时检索。
ANN algorithms are used for scalability, finding close enough vectors quickly rather than exact matches, enabling real-time retrieval over millions or billions of documents.
向量存储还允许元数据过滤(例如,日期、作者)和分片以实现分布式检索,这对于扩展企业 RAG 系统至关重要。
Vector stores also allow metadata filtering (e.g., date, author) and sharding for distributed retrieval, essential for scaling enterprise RAG systems.
检索到的内容的格式及其输入到 LLM 的方式会对输出质量产生重大影响。
How the retrieved content is formatted and fed into the LLM significantly affects output quality.
关键技术包括以下几点:
Key techniques include the following:
精心设计的提示确保语言学习者在生成过程中专注于最重要的信息。
Well-constructed prompts ensure the LLM focuses on the most important information during generation.
随着 RAG 系统的不断发展,人们正在开发先进技术来提高检索质量、提升响应准确率并实现更具上下文感知能力的答案生成。以下是一些先进的 RAG 技术:
As RAG systems evolve, advanced techniques are being developed to enhance retrieval quality, improve response accuracy, and enable more context-aware generation. The following are some of the advanced RAG techniques:
RAG系统已在各行各业迅速普及。让我们来了解一下它的应用:
RAG systems have rapidly gained adoption across industries. Let us understand its applications:
在任何情况下,RAG 都能确保 AI 系统产生可靠、可验证和有理有据的输出。
In every case, RAG ensures the AI system produces reliable, verifiable, and grounded outputs.
随着人工智能系统变得日益复杂,尤其是在红绿灯(RAG)和智能体人工智能系统兴起之后,智能编排的需求变得至关重要。编排指的是如何动态地管理、排序和协调不同的组件,例如检索引擎、语言模型、内存模块和外部工具,以实现特定目标。
As AI systems become increasingly complex, especially with the rise of RAG and agentic AI systems, the need for intelligent orchestration has become critical. Orchestration refers to how different components, such as retrieval engines, language models, memory modules, and external tools, are managed, sequenced, and coordinated dynamically to achieve a specific goal.
与传统的单次调用 LLM 应用不同,RAG 系统和代理系统涉及多步骤推理和动态决策,需要复杂的编排框架。
Unlike traditional single-call LLM applications, RAG systems and agentic systems involve multi-step reasoning and dynamic decision-making, requiring sophisticated orchestration frameworks.
在 RAG 系统中,编排涉及以下内容:
In RAG systems, orchestration involves the following:
LangChain、LlamaIndex 和 Haystack 等框架专门用于自动编排这些步骤,从而更容易构建可扩展且可用于生产的 RAG 管道。
Frameworks like LangChain, LlamaIndex, and Haystack specialize in orchestrating these steps automatically, making it easier to build scalable and production-ready RAG pipelines.
下图解释了 LangChain 如何协调整个 RAG 流程:
The following figure explains how LangChain is orchestrating the entire RAG process:
图 1.7:粗线条由 LangChain 或类似的控制器控制。
Figure 1.7: The fat lines are orchestrated by LangChain or similar orchestrators
良好的 RAG 流程编排可确保以下几点:
Good RAG orchestration ensures the following:
在智能体系统中,编排变得更加动态。
In agentic systems, orchestration becomes even more dynamic.
智能体是一种人工智能实体,能够执行以下操作:
An agent is an AI entity capable of the following:
智能体编排包括以下几个方面:
Agentic orchestration involves the following:
LangChain Agents、LamaIndex、Haystack 等框架为代理系统提供编排原语,使 AI 模型能够像自主的多步骤决策者一样行事。
Frameworks like LangChain Agents, LamaIndex, Haystack, etc., provide orchestration primitives for agentic systems, allowing AI models to behave like autonomous, multi-step decision-makers.
良好的代理编排可确保以下几点:
Good agentic orchestration ensures the following:
编排侧重于管理复杂人工智能系统的整体流程,而另一个基础概念则在更低的层面上运作——即信息本身在模型内部的表示和处理方式。在进行任何检索、生成或推理之前,输入文本必须被分解成模型可以理解的形式——这个过程称为分词。
While orchestration focuses on managing the overall flow of complex AI systems, another foundational concept operates at a much lower level - how information itself is represented and processed inside models. Before any retrieval, generation, or reasoning can happen, input text must be broken down into a form that models can understand—a process known as tokenization.
要充分理解人工智能系统的能力和局限性,就必须了解什么是令牌、令牌化是如何工作的,以及为什么它在塑造性能、成本和设计选择方面起着至关重要的作用。
To fully appreciate the capabilities and limitations of AI systems, it is essential to understand what tokens are, how tokenization works, and why it plays a critical role in shaping performance, cost, and design choices.
现在让我们来了解一下分词。
Let us now understand tokenization.
在现代人工智能系统,特别是语言学习模型(LLM)中,词元(token)的概念对于输入和输出的处理至关重要。词元不一定是一个词;它可以是一个词、词的一部分(子词),甚至是标点符号和特殊字符,具体取决于模型的词元生成器。
In modern AI systems, particularly LLMs, the concept of tokens is fundamental to how inputs and outputs are processed. A token is not necessarily a word; it can be a word, a part of a word (subword), or even punctuation and special characters, depending on the model’s tokenizer.
分词是将文本分解成模型可以理解和处理的离散单元的过程。像 GPT-3、GPT-4 和 Llama 这样的模型并不直接处理原始文本,而是处理词元序列。
Tokenization is the process of breaking down text into discrete units that the model can understand and process. Models like GPT-3, GPT-4, and Llama do not operate directly on raw text; they operate on sequences of tokens.
代币化策略有多种类型,例如以下几种:
There are different types of tokenization strategies, like the following:
代币数量决定:
The number of tokens determines:
因此,了解令牌、令牌代表什么以及如何计数对于优化性能、控制生成长度、管理成本和设计有效的提示工程策略至关重要。
Thus, understanding tokens, what they represent, and how they are counted is critical for optimizing performance, controlling generation length, managing costs, and designing effective prompt engineering strategies.
例如,输入“Today is a beautiful day outside .”可能会被拆分成子词,如(To, day, is, a, be, aut, iful , day , out , side),具体取决于分词器。
For example, the input Today is a beautiful day outside. might be split into subwords like (To, day, is, a, be, aut, iful, day, out, side) depending on the tokenizer.
词元被分割成词元后,每个词元都会使用词汇表(在模型训练期间预先构建)映射到唯一的词元 ID。每个词元 ID 都对应一个模型内部可以理解的整数。例如:
Once split into tokens, each token is then mapped to a unique token ID using a vocabulary table (pre-built during model training). Each token ID corresponds to an integer that the model understands internally. For instance:
因此,整个输入序列被转换为标记 ID 向量——模型可以对其进行操作的数字列表。
Thus, the entire input sequence is transformed into a vector of token IDs—a list of numbers that the model can operate on.
此时,词元 ID 会经过一个嵌入层。该嵌入层将每个词元 ID 转换为一个高维向量(例如,768 维向量),该向量能够捕捉词元之间的语义关系。语义相关的词元(例如,“狗”和“小狗”)在向量空间中的嵌入向量会非常接近。
At this point, the token IDs are passed through an embedding layer. This embedding layer converts each token ID into a high-dimensional vector (e.g., 768-dimensional) that captures semantic relationships between tokens. Tokens that are semantically related (e.g., dog and puppy) will have embeddings that are close in vector space.
从那里开始,词元嵌入会流经模型的架构、注意力层、Transformer 模块,最终生成输出或进行推理。
From there, the token embeddings move through the model’s architecture, the attention layers, transformer blocks, and eventually lead to output generation or reasoning.
总而言之,分词弥合了人类语言和机器理解之间的鸿沟。它将杂乱无章、长度不一的人类文本转换为标准化的数字。深度学习模型可以高效处理的形式。如果没有分词,现代语言模型将无法处理人类交流的复杂性和多样性。下图展示了现代语言模型中的分词流程:
In summary, tokenization bridges the gap between human language and machine understanding. It translates messy, variable-length human text into standardized numerical forms that deep learning models can efficiently process. Without tokenization, modern language models would not be able to handle the complexity and diversity of human communication. The following figure shows the tokenization process flow in modern language models:
虽然分词使语言模型能够以细粒度的方式处理和理解文本输入,但处理大规模检索任务需要另一种表示方法。检索系统不再直接处理词元,而是基于密集向量嵌入——一种能够捕捉文本、图像或其他数据类型语义的数学表示。为了高效地存储、搜索和检索这些嵌入,向量数据库已成为现代人工智能架构的重要组成部分。
While tokenization enables language models to process and understand text inputs at a granular level, handling large-scale retrieval tasks requires a different kind of representation. Instead of working directly with tokens, retrieval systems operate on dense vector embeddings—mathematical representations that capture the semantic meaning of text, images, or other data types. To store, search, and retrieve these embeddings efficiently, vector databases have become an essential component of modern AI architectures.
现在让我们来探讨向量数据库的作用,以及它们如何为可扩展的高性能检索系统提供支持。
Let us now explore the role of vector databases and how they power scalable, high-performance retrieval systems.
在进一步探讨向量数据库之前,首先需要了解它们在其他类型数据库中的定位。
Before we explore vector databases further, it is important to first understand where they fit among other types of databases.
让我们来看一下数据库的类型,它们如下:
Let us look at the types of databases, which are as follows:
其中,向量数据库已成为人工智能、检索和智能推理系统不可或缺的一部分。
Among these, vector databases have become essential for AI, retrieval, and agentic reasoning systems.
向量数据库旨在存储和检索稠密向量嵌入,即文本、图像或音频等非结构化数据的数值表示。与关系型数据库或文档型数据库不同,向量数据库基于余弦相似度或欧氏距离等距离度量进行相似性搜索,而不是进行精确匹配。
Vector databases are designed to store and retrieve dense vector embeddings, numerical representations of unstructured data such as text, images, or audio. Unlike relational or document databases, vector databases perform similarity searches based on distance metrics like cosine similarity or Euclidean distance rather than exact matching.
它们使 AI 模型能够高效地检索语义相似的项目,这是 RAG 和内存增强型代理系统的关键操作。
They enable AI models to retrieve semantically similar items efficiently, a critical operation for RAG and memory-augmented agentic systems.
高效的向量搜索是人工智能驱动应用的基础,但对数百万个高维向量进行穷举搜索计算量巨大。为了解决这一问题,我们需要……为了应对这一挑战,向量数据库使用专门的索引算法来提高搜索速度,同时兼顾准确性和内存效率。
Efficient vector search is fundamental to AI-driven applications, but performing brute-force searches over millions of high-dimensional vectors is computationally intensive. To address this challenge, vector databases use specialized indexing algorithms that improve search speed while balancing accuracy and memory efficiency.
以下是一些常用的索引技术:
The following are some commonly used indexing techniques:
每种索引方法在速度、准确性和内存使用方面都各有侧重。最佳选择取决于应用程序的具体需求,包括数据集大小、性能要求和基础设施限制。
Each of these indexing methods offers a different balance between speed, accuracy, and memory usage. The best choice depends on the specific needs of the application, including dataset size, performance requirements, and infrastructure constraints.
向量数据库旨在高效地存储和检索高维向量嵌入,这对于驱动现代人工智能应用(例如语义搜索、推荐系统和图像相似度分析)至关重要。数据被编码成向量嵌入并建立索引后,即可使用搜索算法检索给定查询向量的最近邻向量。
Vector databases are designed to store and retrieve high-dimensional vector embeddings efficiently, key to powering modern AI applications like semantic search, recommendation systems, and image similarity. Once data is encoded into vector embeddings and indexed, search algorithms are used to retrieve the nearest neighbors to a given query vector.
两种主要方法如下:
The two main approaches are as follows:
常用的ANN技术包括以下几种:
Common ANN techniques include the following:
大多数生产级向量数据库(如 Faiss、Milvus 或 Pinecone)都依赖 ANN 搜索来实现低延迟、高吞吐量的性能,而不会在相关性或召回率方面做出太大牺牲。
Most production-grade vector databases (like Faiss, Milvus, or Pinecone) rely on ANN search to deliver low-latency, high-throughput performance without sacrificing too much on relevance or recall.
向量数据库的核心是嵌入的概念。嵌入是一个稠密的数值向量,它以一种方式捕捉输入(文本、图像、音频)的语义含义,使得相似的输入在向量空间中彼此更接近。
At the core of vector databases lies the concept of embeddings. Embedding is a dense numerical vector that captures the semantic meaning of an input (text, image, audio) in a way that similar inputs are closer together in the vector space.
例如,即使措辞不同,两个关于狗的句子也会有紧密相邻的嵌入。
For example, two sentences about dogs will have embeddings close together even if they use different wording.
嵌入模型是经过训练的神经网络,可以将输入映射到这些向量空间。一些常见的嵌入模型类型如下:
Embedding models are neural networks trained to map inputs into these vector spaces. Some popular types of embedding models are as follows:
嵌入模型至关重要,因为检索质量很大程度上取决于嵌入的质量。
Embedding models are crucial because the quality of retrieval depends heavily on the quality of embeddings.
在 RAG 管道中,查询的嵌入与向量数据库中存储的文档嵌入进行匹配,以检索最相关的知识来确定响应。
In RAG pipelines, embeddings of queries are matched against stored document embeddings inside a vector database to retrieve the most relevant knowledge for grounding responses.
在智能体系统中,智能体可能需要:
In agentic systems, an agent might need to:
所有操作均在运行时动态进行,基于向量相似度,而不仅仅是关键词匹配。
All dynamically at runtime, based on vector-based similarity, not just keyword matching.
因此,向量数据库能够实现语义记忆和可扩展的智能检索,这两点是新一代人工智能的基石。下图展示了使用嵌入的类似搜索流程,其中输入数据被转换为向量以检索语义相似的结果:
Thus, vector databases enable semantic memory and scalable, intelligent retrieval, two cornerstones of the new age of GenAI. The following figure represents the workflow of a similar search pipeline using embeddings, where input data is transformed into vectors to retrieve semantically similar results:
Figure 1.9: Basic flow of semantic retrieval using a vector database
虽然向量数据库能够快速高效地检索语义相似的文档,但相似度搜索返回的结果并不总是与用户的真实意图完全一致。单纯基于向量相似度的检索有时会检索到相关性不高的文档,导致最终结果准确性降低或缺乏依据。
While vector databases enable fast and efficient retrieval of semantically similar documents, the top results returned by similarity search are not always perfectly aligned with the user's true intent. Retrieval-based purely on vector similarity can sometimes surface documents that are only loosely relevant, leading to less accurate or less grounded final outputs.
为了应对这一挑战,通常会引入一个重要的优化步骤,称为重排序。重排序允许人工智能系统基于更深层次的相关性评分对检索到的文档进行重新排序和优先级排序,从而提高最终传递给语言模型进行生成的输入质量。
To address this challenge, an important refinement step called reranking is often introduced. Reranking allows AI systems to reorder and prioritize retrieved documents based on deeper relevance scoring, improving the quality of the inputs ultimately passed to the language model for generation.
现在让我们来了解一下重排序,为什么需要它,它是如何工作的,以及现代人工智能流程中使用的不同方法。
Let us now understand reranking, why it is needed, how it works, and the different approaches used in modern AI pipelines.
重新排序的概念对人工智能来说并不新鲜,它在推荐系统和搜索引擎中有着深厚的渊源。
The concept of reranking is not new to AI. It has deep roots in recommendation systems and search engines.
在传统的推荐流程(例如,推荐产品、电影、文章)中,系统通常会根据用户偏好检索一大堆候选结果,例如排名前 100 或前 1000 的项目。基于用户历史记录或内容相似度等粗略匹配,检索系统会进行初步筛选。然而,这些初始候选结果往往并不完美,因为检索系统优先考虑召回率,力求尽可能多地检索到潜在的有效条目,即使这会牺牲精确度。
In traditional recommendation pipelines (e.g., recommending products, movies, articles), the system typically retrieves a broad set of candidates, say, the top 100 or top 1000 items, based on rough matching like user history or content similarity. However, these initial candidates are often imperfect, as retrieval systems prioritize recall, getting as many potentially good items as possible, even at the cost of precision.
因此,引入了重新排序步骤,详情如下:
Thus, a reranking step is introduced, details as follows:
这种两阶段方法,即检索和重新排序,现在不仅是推荐系统的基础,而且也是现代RAG管道和搜索引擎的基础。
This two-stage approach, retrieval and reranking, is now fundamental not just in recommendation systems but also in modern RAG pipelines and search engines.
双编码器和交叉编码器是自然语言处理(NLP )中用于语义搜索和排序等任务的两种常用架构。
Bi-encoders and cross-encoders are two popular architectures used for tasks like semantic search and ranking in natural language processing (NLP).
双编码器使用同一模型将查询和文档独立编码成不同的向量嵌入。然后可以使用余弦相似度或其他距离度量有效地比较这些嵌入,这使得双编码器非常适合对速度和可扩展性要求极高的大规模检索。
Bi-encoders independently encode the query and document into separate vector embeddings using the same model. These embeddings can then be efficiently compared using cosine similarity or other distance metrics, making bi-encoders ideal for large-scale retrieval where speed and scalability are critical.
另一方面,交叉编码器将查询和文档联合编码,并将它们一起输入到Transformer模型中。这使得模型能够考虑词元之间的交叉注意力,从而获得更准确的相关性评分。然而,这种方法计算量大且速度慢,限制了其在实时或大规模系统中的应用。
Cross-encoders, on the other hand, jointly encode the query and document by feeding them together into a transformer model. This allows the model to consider cross-attention between tokens, resulting in more accurate relevance scoring. However, this approach is computationally expensive and slower, limiting its use in real-time or large-scale systems.
在检索和重排序的背景下,一种常见的模式是使用双向编码器进行快速候选检索,然后使用交叉编码器对排名靠前的结果进行重排序以提高精度,从而有效地平衡效率和准确性:
In the context of retrieval and reranking, a common pattern is to use bi-encoders for fast candidate retrieval, followed by cross-encoders for reranking the top results to improve precision, balancing efficiency and accuracy effectively:
在 RAG 流程或 AI 搜索系统中,典型的工作流程如下:
In RAG pipelines or AI search systems, the typical workflow is as follows:
由于交叉编码器模型在查询文档和候选文档之间实现了深度交互,因此显著提高了检索信息的质量,从而在生成过程中产生了更有依据、更准确、更具上下文相关性的输出。
Due to the cross-encoder model’s deep interactions between the query and candidate documents, it significantly improves the quality of retrieved information, leading to better-grounded, more accurate, and more contextually relevant outputs during generation.
因此,重排序,特别是使用交叉编码器的重排序,是当今构建高精度、生产级 AI 检索系统的重要工具,如下图所示:
Thus, reranking, especially using cross-encoders, is a vital tool in building high-precision, production-grade AI retrieval systems today, as shown in the following figure:
Figure 1.10: Reranking architecture for improving document relevance
虽然重排序可以提高检索信息的质量和相关性,但它并不能从根本上保证人工智能的最终输出始终安全、公正或符合应用需求。即使检索精度很高,生成模型仍然可能出现偏差,引入敏感内容,或生成与用户预期不符的输出。为了应对这些风险,现代人工智能系统实施了防护机制、结构化控制和验证机制,旨在监控、过滤和调整模型行为。下一节我们将探讨防护机制的概念、其重要性以及它们在检索和生成流程中的应用。
While reranking improves the quality and relevance of retrieved information, it does not inherently guarantee that the AI’s final output will always be safe, unbiased, or aligned with application requirements. Even with high-precision retrieval, generation models can still hallucinate, introduce sensitive content, or produce outputs that deviate from user expectations. To address these risks, modern AI systems implement guardrails, structured controls, and validation mechanisms designed to monitor, filter, and shape model behavior. In the next section, we will explore the concept of guardrails, why they are essential, and how they are applied across retrieval and generation pipelines.
随着人工智能系统能力的不断提升,设置防护机制(即引导和约束模型行为的结构化控制措施)变得至关重要。防护机制确保模型即使在处理复杂、开放式的输入时,也能安全、合乎伦理地运行,并符合应用或组织的目标。
As AI systems become increasingly capable, the need for guardrails, structured controls that guide and constrain model behavior, has become critical. Guardrails ensure that models act safely, ethically, and in alignment with application or organizational goals, even when handling complex, open-ended inputs.
虽然重新排序有助于呈现更多相关且真实的信息,但它本身并不能防止幻觉、偏见传播、政策违规或用户操纵。逻辑逻辑模型(LLM)功能强大但缺乏确定性;即使输入数据干净,如果不加以控制,也可能产生不安全、冒犯性或误导性的输出。安全防护措施有助于维护信任、安全性和合规性。在现实世界环境中部署人工智能系统时,尤其是在企业、医疗保健、金融和教育领域,这些都是至关重要的因素。下图展示了一个增强了安全防护措施、重新排序和基于逻辑逻辑模型的响应生成的红绿灯(RAG)系统的架构:
While reranking helps in surfacing more relevant and factual information, it does not inherently prevent hallucinations, bias propagation, policy violations, or user manipulation. LLMs are powerful but non-deterministic; even with clean inputs, they can produce unsafe, offensive, or misleading outputs if left unchecked. Guardrails help maintain trust, safety, and compliance. They are all crucial factors when deploying AI systems in real-world environments, especially in the enterprise, healthcare, finance, and education sectors. The following figure illustrates the architecture of a RAG system enhanced with guardrails, reranking, and LLM-based response generation:
Figure 1.11: End to end figure of guardrail enabled GenAI system
护栏通常分两个主要阶段运行,具体如下:
Guardrails typically operate at two major stages, which are described in the following list:
护栏的实施是通过多种技术的结合来实现的:
Guardrails are implemented through a combination of techniques:
如果没有防护措施,人工智能系统容易受到以下影响:
Without guardrails, AI systems are vulnerable to the following:
这些风险可能会给组织造成声誉损害、违反合规规定、损害用户利益,甚至带来法律后果。
These risks can cause reputational damage, compliance violations, user harm, and even legal consequences for organizations.
以下领先的人工智能平台已经意识到需要健全的防护措施,并构建了专门的框架:
The following leading AI platforms have recognized the need for robust guardrails and built specialized frameworks:
审核 API 会返回详细的分数,这些分数表明违规的可能性,从而允许开发者:
The Moderation API returns detailed scores indicating the likelihood of a violation, allowing developers to:
通过将审核 API 集成到生产管道中,开发人员可以确保模型的行为符合安全和合规标准,而无需持续的人工监控。
By integrating the Moderation API into production pipelines, developers ensure that models behave consistently with safety and compliance standards, without requiring constant manual monitoring.
这些工具表明,防护措施不再是可有可无的,而是构建负责任的、可用于生产的 AI 应用的基础。
These tools show that guardrails are no longer optional and are foundational to building responsible, production-ready AI applications.
虽然检索、重排序和防护机制显著提高了人工智能系统的可靠性和安全性,但真正的智能行为要求模型超越单次响应。现代人工智能应用越来越多地涉及智能体。智能体是能够自主推理、决策、规划和使用工具的系统。尽管我们将探讨智能体,但……在第五章“实现具有人机交互的智能体全智能系统”中,我们将更深入地探讨这些内容。在此之前,有必要介绍一些核心概念:智能体如何利用工具、进行推理、制定计划、执行动作、跨任务保持记忆,以及如何在多智能体系统中协作解决复杂目标。理解这些基础概念将有助于我们在后续章节中构建更动态、更具适应性的人工智能解决方案。
While retrieval, reranking, and guardrails significantly enhance the reliability and safety of AI systems, true intelligent behavior requires models to go beyond single-turn responses. Modern AI applications increasingly involve agents. They are systems capable of autonomous reasoning, decision-making, planning, and tool use. Although we will explore agents in greater depth in Chapter 5, Implementing Agentic GenAI Systems with Human-AI Interaction, it is important to introduce the core concepts: how agents leverage tools, perform reasoning, develop plans, execute actions, maintain memory across tasks, and collaborate in multi-agent systems to solve complex goals. Understanding these foundational ideas will prepare us for building more dynamic, adaptable AI solutions in the chapters ahead.
生成人工智能(GenAI)智能体是一种智能软件系统,它利用生成模型(例如逻辑线性模型或扩散模型)来理解、推理并根据用户输入或环境刺激创建内容。它可以执行诸如回答问题、生成文本或图像、总结内容,甚至协作解决问题等任务。生成人工智能智能体通常与各种工具或应用程序接口(API)集成,既可以独立运行,也可以在大型多智能体系统中运行。它们观察输入,基于学习到的模式做出决策,并采取与目标一致的行动,在创造性和功能性环境中模拟人类的认知能力。请参考以下列表以更深入地了解智能体:
A GenAI agent is an intelligent software system that uses generative models, such as LLMs or diffusion models, to understand, reason, and create content in response to user input or environmental stimuli. It can perform tasks like answering questions, generating text or images, summarizing content, or even collaborating in problem-solving. GenAI agents often integrate with tools or APIs and can operate autonomously or within a larger multi-agent system. They observe inputs, make decisions based on learned patterns, and take actions aligned with their goals, mimicking human-like cognition in creative and functional contexts. Refer to the following list to build a deeper understanding of agents:
Figure 1.12: Agents flow, how an agent interacts with environment and takes action
在非智能体 RAG 系统中,流程是线性且静态的:用户嵌入查询,检索器获取前 k 个文档,语言模型利用检索到的上下文生成答案。每个步骤都遵循固定的流程,缺乏动态决策。非智能体 RAG 在简单的问答任务中表现出色,因为初始检索结果通常足够,但当检索结果嘈杂、模糊或不足以进行复杂推理时,它就显得力不从心。
In a non-agentic RAG system, the process is linear and static: a user query is embedded, a retriever fetches top-k documents, and the language model-generates an answer using the retrieved context. Each step follows a fixed pipeline without dynamic decision-making. Non-agentic RAG excels in simple question answer tasks where the initial retrieval is usually sufficient, but it struggles when retrieval results are noisy, ambiguous, or insufficient for complex reasoning.
相比之下,智能体 RAG 系统引入了动态控制、推理和适应性。智能体首先评估查询,检索初始文档,并判断信息是否充足。如果信息不足,智能体可以重新构建查询,执行多次检索,选择不同的工具(例如搜索 API 或数据库),反思中间结果,并动态规划多个步骤以得出更可靠的答案。智能体 RAG 系统可以迭代地从多个知识源检索、重新排序、推理和综合信息,并实时调整以解决复杂、多跳或歧义的查询。
In contrast, an agentic RAG system introduces dynamic control, reasoning, and adaptability. An agent first assesses the query, retrieves initial documents, and reasons about whether the information is sufficient. If not, the agent can reformulate the query, perform multiple retrievals, choose different tools (like search APIs or databases), reflect on intermediate results, and dynamically plan multiple steps to arrive at a better-grounded answer. Agentic RAG systems can iteratively retrieve, rerank, reason, and synthesize across multiple knowledge sources, adapting in real-time to solve complex, multi-hop, or ambiguous queries.
因此,虽然非智能体 RAG 对于简单的任务来说简单快捷,但智能体 RAG 对于构建真正智能、可靠的系统至关重要,这些系统能够处理不确定性、不完整的数据或不断变化的信息需求。图 1.13展示了一个多智能体系统,该系统包含一个编排智能体和两个能够执行联合任务以及单独任务的附加智能体:
Thus, while non-agentic RAG is simple and fast for straightforward tasks, agentic RAG is critical for building truly intelligent, reliable systems that can handle uncertainty, incomplete data, or evolving information needs. Figure 1.13 illustrates a multi-agent system featuring an orchestration agent and two additional agents capable of performing joint tasks as well as individual tasks:
Figure 1.13: Multi-agent systems with an orchestration agent
智能体系统赋予人工智能模型自主推理、规划和执行任务的能力,其方式是通过动态地使用工具、API 和外部知识库。然而,如果没有标准化的方法来发现和交互这些工具,扩展就会变得混乱且脆弱。这正是 MCP 的关键所在。MCP 为智能体提供了一个通用的、与语言无关的接口,使其能够无缝访问工具、数据和提示,从而确保安全、模块化和动态的集成。
Agentic systems empower AI models to reason, plan, and execute tasks autonomously by dynamically using tools, APIs, and external knowledge sources. However, without a standardized way to discover and interact with these tools, scaling becomes chaotic and fragile. This is where the MCP is essential. MCP provides a universal, language-agnostic interface for agents to seamlessly access tools, data, and prompts, ensuring secure, modular, and dynamic integration.
MCP 是一种开放标准,旨在简化和规范 AI 模型与外部工具、数据源和 API 的交互方式。MCP 由Anthropic公司推出,它充当通用通信层,类似于 AI 领域的 USB-C 接口,使 AI 助手和代理能够无缝检索结构化信息、调用操作或应用特定领域的提示,而无需为每个后端系统进行自定义集成。
MCP is an open standard designed to simplify and standardize how AI models interact with external tools, data sources, and APIs. Introduced by Anthropic, MCP acts as a universal communication layer, much like a USB-C for AI, enabling AI assistants and agents to seamlessly retrieve structured information, invoke actions, or apply domain-specific prompts without custom integrations for every backend system.
MCP 的核心是构建客户端-服务器架构,其中服务器公开三种基本要素:工具(执行操作的函数)、资源(数据,例如文档或 API)和提示(AI 行为的指导)。MCP 使用轻量级、与语言无关的协议,例如 JSON-RPC,并通过 Studio 或 HTTP/SSE 等传输协议进行通信,因此易于在各种环境中集成。
At its core, MCP establishes a client-server architecture where servers expose three primitives: tools (functions that perform actions), resources (data like documents or APIs), and prompts (guidance for AI behavior). MCP uses lightweight, language-agnostic protocols like JSON-RPC over transports such as studio or HTTP/SSE, making it easy to integrate across diverse environments.
通过采用 MCP,开发人员可以构建可扩展的 AI 系统,无需重新训练模型或硬编码 API,即可动态发现和利用新的工具和数据源。MCP 还确保了模块化、安全性和面向未来的适应性,这对于医疗保健、金融和企业自动化等行业至关重要。随着 AI 生态系统日益复杂,MCP 为构建可互操作、安全且敏捷的 AI 系统奠定了基础,这些系统可以通过统一的接口在多个领域进行推理和行动。
By adopting MCP, developers can build scalable AI systems where new tools and data sources can be dynamically discovered and utilized without retraining models or hardcoding APIs. MCP also ensures modularity, security, and future-proofing, critical for sectors like healthcare, finance, and enterprise automation. As AI ecosystems grow increasingly complex, MCP provides a foundation for building interoperable, secure, and agile AI systems that can reason and act across multiple domains through a unified interface.
智能体系统与 MCP 的结合,使人工智能能够更智能、更可靠地运行,无需硬编码依赖即可适应现实世界的复杂性,从而在医疗保健、金融和教育等行业中释放强大的应用潜力。图 1.14展示了 MCP 如何建立客户端-服务器架构并与外部工具、数据源和 API 进行交互:
Together, agentic systems and MCP enable AI to operate more intelligently and reliably, adapting to real-world complexities without hardcoded dependencies, unlocking powerful applications across industries like healthcare, finance, and education. Figure 1.14 shows how MCP establishes a client-server architecture and interacts with external tools, data sources, and APIs:
图 1.14:MCP 建立客户端-服务器架构,并与外部工具、数据源和 API 进行交互。
Figure 1.14: MCP establishes a client-server architecture and interacts with external tools, data sources, and APIs
本章为理解现代生成人工智能(GenAI)系统的设计和编排奠定了基础。我们首先区分了检索系统和生成系统,探讨了它们各自在构建智能人工智能解决方案中发挥的关键作用。我们讨论了从传统的基于关键词的检索到由嵌入驱动的密集向量搜索的演变,以及向量数据库如何实现可扩展的实时语义检索。超越基础检索,我们介绍了重排序技术,特别是交叉编码器的应用,以优化检索到的文档并确定其优先级,从而提高相关性和精确度。随后,我们强调了安全防护措施的重要性,以确保人工智能的输出安全、符合伦理,并符合现实世界的使用标准。最后,我们介绍了新兴的智能体人工智能系统,涵盖了工具使用、推理、规划、行动、记忆和多智能体协作等关键概念。
In this chapter, we laid the foundation for understanding how modern GenAI systems are designed and orchestrated. We began by differentiating between retrieval systems and generation systems, exploring how each plays a critical role in building intelligent AI solutions. We discussed the evolution from traditional keyword-based retrieval to dense vector search powered by embeddings, and how vector databases enable scalable, real-time semantic retrieval. Moving beyond basic retrieval, we introduced reranking techniques, particularly the use of cross-encoders, to refine and prioritize retrieved documents for greater relevance and precision. We then emphasized the importance of guardrails to ensure AI outputs are safe, ethical, and aligned with real-world usage standards. Finally, we introduced the emerging world of agentic AI systems, covering key concepts such as tool use, reasoning, planning, action, memory, and multi-agent collaboration.
下一章,我们将探索不断扩展的多模态系统领域,人工智能应用不再局限于单一的输入或输出模式。重点将转向多模态GenAI架构,在这种架构中,文本、图像和结构化数据在统一的框架内进行交互。读者将学习人工智能系统如何将文本转换为图像,如何将图像解读为描述,如何组合输入以生成新的输出,甚至如何将自然语言翻译成结构化查询语言(SQL )。这为构建丰富且具有上下文感知能力的人工智能体验奠定了基础。
In the next chapter, we explore the expanding frontier of multimodal systems, where AI applications are no longer limited to a single mode of input or output. The focus then shifts to multimodal GenAI architectures, where text, images, and structured data interact within unified frameworks. Readers will learn how AI systems transform text into images, interpret images into descriptions, combine inputs for new outputs, and even translate natural language into Structured Query Language (SQL). This sets the foundation for building rich, contextually aware AI experiences.
第一章介绍了现代生成式人工智能(GenAI )的基础知识,涵盖检索系统、生成模型、检索增强生成(RAG )、编排、分词、向量数据库、重排序、防护机制、代理系统和模型上下文协议(MCP )。这些核心组件为构建智能的、文本驱动的生成系统奠定了基础。
The first chapter introduced the foundations of modern generative AI (GenAI), covering retrieval systems, generation models, retrieval-augmented generation (RAG), orchestration, tokenization, vector databases, reranking, guardrails, agent systems, and Model Context Protocols (MCPs). These core components established the groundwork for building intelligent, text-driven generative systems.
在此基础上,本章探讨了人工智能向多模态领域的演进,即文本、图像和其他数据类型的协同处理。我们首先在视觉语言模型(VLM )的背景下解释交叉编码器和双编码器,然后讨论多模态向量嵌入和多模态向量数据库的设计。
Building on this foundation, this chapter explores the evolution of AI into multimodal domains, where text, images, and other data types are processed together. We begin by explaining cross-encoders and bi-encoders within the context of vision-language models (VLMs), followed by a discussion on multimodal vector embeddings and the design of multimodal vector databases.
本章进一步阐明了虚拟语言模型(VLM)与更广泛的多模态全息人工智能(GenAI)系统的区别。文中涵盖了实际应用,包括文本到图像生成、图像到文本的描述、文本和图像到图像的合成,以及基于文本的规范和图像生成。此外,我们还探讨了文本到SQL查询生成如何扩展多模态人工智能系统的潜力。
The chapter further clarifies how VLMs differ from broader multimodal GenAI systems. Practical applications, including text-to-image generation, image-to-text captioning, text and image-to-image synthesis, and text-driven specification and image generation, are covered. Additionally, we explore how text-to-SQL query generation expands the potential of multimodal AI systems.
本章中,我们将从理解生成模型的基本机制,逐步过渡到开发能够进行复杂、跨模态推理的系统,从而为在现实世界环境中进行高级应用做好准备。
Through this chapter, we move from understanding the basic mechanisms of generative models to developing systems capable of sophisticated, cross-modal reasoning, positioning us for advanced applications in real-world environments.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在帮助读者全面理解设计和部署现代全智能(GenAI)系统所需的关键构建模块。通过掌握检索和生成系统、向量数据库、嵌入技术、高级提示策略、智能体架构和多智能体协作等概念,读者将为构建智能、可扩展的人工智能解决方案奠定坚实的基础。此外,本章还介绍了本地模型部署、图形处理器(GPU )基础设施、语音处理、智能体内存管理以及多智能体协作(MCP)等行业标准等关键主题。这些基础要素对于构建多模态、可靠且可用于生产环境的人工智能应用至关重要。
This chapter aims to equip readers with a comprehensive understanding of the key building blocks essential for designing and deploying modern GenAI systems. By mastering concepts such as retrieval and generation systems, vector databases, embedding techniques, advanced prompting strategies, agentic architectures, and multi-agent collaboration, readers will gain a strong foundation for building intelligent, scalable AI solutions. Additionally, the chapter introduces critical topics like local model deployment, graphics processing unit (GPU) infrastructure, speech processing, memory management in agents, and industry standards like MCPs. These foundational elements are crucial for advancing toward multimodal, reliable, and production-ready AI applications.
视觉语言模型(VLM)是多模态人工智能系统的基础,它弥合了视觉和文本理解之间的鸿沟。与主要独立处理文本或图像的传统全息人工智能(GenAI)不同,VLM旨在跨模态进行联合解释、对齐和生成。随着越来越多的组织寻求构建能够“看”和“说”的系统,VLM对于视觉问答(VQA )、图像字幕生成、跨模态检索,甚至文本驱动的图像生成等应用都变得至关重要。
VLMs form the foundation of multimodal AI systems that bridge the gap between visual and textual understanding. Unlike traditional GenAI, which primarily processes text or images in isolation, VLMs are designed to jointly interpret, align, and generate across both modalities. As organizations increasingly look to create systems that see and talk, VLMs have become critical for applications such as visual question answering (VQA), captioning, cross-modal retrieval, and even text-driven image generation.
在前一章讨论核心生成和检索概念的基础上,本节深入探讨了 VLM 的架构、类型和功能,重点介绍了它们如何将分词、向量嵌入、检索和生成原理扩展到多种数据形式。
Building on the foundations laid in the previous chapter, where we discussed core generative and retrieval concepts, this section delves into the architecture, types, and capabilities of VLMs, highlighting how they extend the principles of tokenization, vector embeddings, retrieval, and generation across multiple data forms.
视觉语言模型(VLM)是功能强大的AI系统,它融合了视觉和文本理解能力,使机器能够处理、解释和生成跨两种模态的信息。随着该领域的不断发展,VLM的应用范围日益广泛,从检索与搜索查询匹配的图像到生成详细的图像描述,甚至对文档进行推理。为了更好地理解VLM的功能和设计,我们可以对其进行广泛的分析。根据其核心目标,虚拟语言模型(VLM)可分为以下几类:检索、字幕生成和质量保证(QA)、生成式合成、多模态推理和指令调整。每一类都体现了独特的架构重点,并支持电子商务、无障碍访问、教育和设计等行业的实际应用。VLM 可以根据其主要任务和架构设计目标进行大致分类,具体如下:
VLMs are powerful AI systems that integrate visual and textual understanding, enabling machines to process, interpret, and generate information across both modalities. As the field evolves, VLMs are increasingly specialized in serving diverse applications, from retrieving the right image for a search query to generating detailed image descriptions or even reasoning over documents. To better understand their capabilities and design, VLMs can be broadly categorized based on their core objectives: retrieval, captioning, and QA, generative synthesis, multimodal reasoning, and instruction-tuning. Each category reflects a unique architectural focus and supports real-world applications across industries like e-commerce, accessibility, education, and design. VLMs can be broadly classified based on their primary tasks and architectural design goals, which are as follows:
尽管任务各不相同,但大多数虚拟学习模块都遵循一些共同的设计原则,例如:
Despite task differences, most VLMs share common design principles, like the following:
然后将这两个序列连接起来,并传递给一个转换器,该转换器对整个连接序列执行自注意力机制,这意味着:
这种完全的交叉注意力机制使得视觉和语言表征在每一层都能进行丰富而精细的交互。例如,像“狗”这样的词可以关注显示狗的特定图像块,反之亦然。
图 2.1展示了一种基于交叉编码器的 VLM 架构,该架构专为需要深度理解图像及其对应文本描述的任务而设计,例如将产品照片与详细规格进行匹配。与为图像和文本分别生成嵌入向量的双编码器不同,该方法不依赖于嵌入向量,而是通过将输入作为组合对进行处理来计算直接相关性得分。通过融合注意力机制或交叉注意力机制,该模型能够捕捉跨模态的细粒度交互。这种设置非常适合对对齐精度要求极高的场景,例如电子商务产品验证、视频问答 (VQA) 和多模态文档理解。
These two sequences are then concatenated and passed into a transformer that performs self-attention across the entire joint sequence, meaning:
This full cross-attention allows rich, fine-grained interactions between vision and language representations at every layer. For example, a word like dog can attend to specific image patches showing the dog, and vice versa.
Figure 2.1 depicts a cross-encoder-based VLM architecture designed for tasks requiring deep joint understanding of images and their corresponding textual descriptions, such as matching product photos with detailed specifications. Unlike dual encoders that generate separate embeddings for images and text, this approach does not rely on embeddings but instead computes a direct relevance score by processing the input as a combined pair. Through merged attention or cross-attention mechanisms, the model captures fine-grained interactions across modalities. This setup is ideal for scenarios where alignment precision is critical, such as e-commerce product verification, VQA, and multimodal document understanding.
Figure 2.1: Cross-encoders jointly process both image and text
图像和文本分别使用两个独立的编码器进行编码,生成各自的嵌入向量。通常情况下,视觉编码器(例如,ViT 或 CNN)处理图像并生成图像嵌入向量;文本编码器处理文本嵌入向量。(例如,基于Transformer的语言模型,如BERT)处理文本并生成文本嵌入向量。
重要的是,图像和文本在编码过程中不会相互干扰。
在推理过程中,图像特征和文本特征之间不存在交叉注意力。
分别生成两个嵌入向量后,在编码后对它们进行比较,通常通过计算相似度得分来实现,例如:
下图展示了两个嵌入在向量空间中越接近,它们之间的相关性就越高:
The image and text are encoded separately into their own embeddings using two independent encoders, typically: a vision encoder (e.g., a ViT or CNN) processes the image and produces an image embedding vector. A text encoder (e.g., a transformer-based language model like BERT) processes the text and produces a text embedding vector.
Importantly, the image and text do not interact during encoding.
There is no cross-attention between image and text features during inference.
Once both embeddings are generated independently, they are compared after encoding, often by computing a similarity score such as:
The following figure depicts how closer the two embeddings are in the vector space, the more relevant they are considered to each other:
在视觉语言模型(VLM)中,融合机制指的是如何将来自不同模态的信息(通常是图像的视觉特征和语言的文本特征)结合起来,形成联合表示。有效的融合对于使模型能够处理视觉和文本输入至关重要。
Fusion mechanisms in VLMs refer to how information from different modalities, typically visual features from images and textual features from language, is combined to form a joint representation. Effective fusion is crucial for enabling models to reason across both vision and text inputs.
融合策略有多种类型。早期融合在输入阶段将图像和文本嵌入结合起来,使模型能够从一开始就联合学习跨模态交互。后期融合则分别处理每种模态,并在后期将它们的输出合并。中间融合阶段通常在最终决策之前进行。中间融合(或跨模态融合)在部分处理后组合特征,从而在模型的前向传播过程中实现更复杂的模态间交互。融合机制通常使用交叉注意力层来实现,其中来自一种模态(例如图像区域)的特征会关注来自另一种模态(例如文本标记)的特征。这类似于Transformer模型使用注意力来关联序列的不同部分,但这里的注意力是跨模态的。交叉注意力使模型能够在处理文本时选择性地关注图像的相关部分,反之亦然。
There are several types of fusion strategies. Early fusion combines image and text embeddings at the input stage, allowing the model to jointly learn cross-modal interactions from the beginning. Late fusion processes each modality separately and merges their output at a later stage, typically before final decision-making. Intermediate fusion (or cross-modal fusion) combines features after partial processing, allowing for more sophisticated interactions between modalities during the model's forward pass. Fusion mechanisms are often implemented using cross-attention layers, where features from one modality (e.g., image regions) attend to features from the other modality (e.g., text tokens). This is similar to how transformers use attention to relate different parts of a sequence, but here the attention operates across modalities. Cross-attention enables models to selectively focus on relevant parts of an image when processing text and vice versa.
因此,虽然融合机制广义上是指模态的组合,但交叉注意力是一种经常在融合策略中使用的具体技术。
Thus, while fusion mechanisms refer broadly to the combining of modalities, cross-attention is a specific technique often used within fusion strategies.
视觉学习模型(VLM)代表了人工智能的一次关键演进,它将计算机视觉和自然语言处理(NLP )的优势融合到统一而强大的架构中。从检索和图像描述到多模态推理和指令调优,VLM 正在为下一代智能系统铺平道路,这些系统能够通过多种感官与世界互动。在接下来的章节中,我们将深入探讨多模态全人类人工智能(GenAI)系统,VLM 的能力和局限性将提供一个重要的参考点,凸显构建真正多功能、类人人工智能的机遇和挑战。
VLMs represent a critical evolution in AI, merging the strengths of computer vision and natural language processing (NLP) into unified, powerful architectures. From retrieval and captioning to multimodal reasoning and instruction-tuning, VLMs are paving the way for the next-generation of intelligent systems capable of interacting with the world through multiple senses. As we proceed deeper into multimodal GenAI systems in the next sections, the capabilities and limitations of VLMs provide a vital reference point, highlighting the opportunities and challenges of building truly versatile, human-like AI.
尽管虚拟语言模型在视觉质量评估、图像描述和跨模态检索等任务中取得了越来越大的成功,但它们仍面临着一些关键挑战,限制了其更广泛的应用和实际部署。这些挑战既源于架构上的局限性,也源于数据、性能和泛化方面的实际限制:
Despite their growing success across tasks such as VQA, image captioning, and cross-modal retrieval, VLMs face several critical challenges that limit their broader applicability and real-world deployment. These challenges stem from both architectural limitations and practical constraints in data, performance, and generalization:
克服这些挑战需要在数据管理、模型架构、训练效率以及与检索或编排系统的集成方面进行创新,其中许多挑战在更广泛的多模态 GenAI 框架中得到了解决。
Overcoming these challenges requires innovations in data curation, model architecture, training efficiency, and integration with retrieval or orchestration systems, many of which are addressed in broader multimodal GenAI frameworks.
训练视觉学习模型(VLM)是一个极其消耗资源的过程。这些模型需要数百万甚至数十亿个对齐的图像-文本对才能学习有意义的多模态表示。整理如此庞大的数据集需要付出大量努力,包括数据收集、清洗、质量过滤,有时还需要人工标注以确保正确对齐。除了数据成本之外,计算成本也非常高。VLM 通常使用大型架构,例如用于图像的 ViT 和用于文本的基于 Transformer 的编码器。从头开始训练它们需要运行数周甚至数月的大量 GPU 或 TPU 集群。例如,像 CLIP(OpenAI)和 ALIGN(Google)这样的模型,由于硬件、存储和能源成本的限制,普通组织很难复制其训练数据集。此外,要实现良好的泛化能力,需要涵盖广泛的视觉和文本概念的多样化数据集,这进一步增加了数据采集的难度。对于大多数组织来说,微调预训练的 VLM 更为可行,但如果需要大规模的领域自适应,即使是这种方法也可能成本高昂。
Training VLMs is an extremely resource-intensive process. These models require millions or even billions of aligned image-text pairs to learn meaningful multimodal representations. Curating such massive datasets involves substantial effort, including data collection, cleaning, filtering for quality, and sometimes human labeling to ensure proper alignment. Beyond data, computational costs are also very high. VLMs typically use large architectures, such as ViT for images and transformer-based encoders for text. Training them from scratch demands extensive GPU or TPU clusters running for weeks or even months. For example, models like CLIP (OpenAI) and ALIGN (Google) were trained on datasets that regular organizations cannot easily replicate due to hardware, storage, and energy costs. Moreover, achieving good generalization requires diverse and broad datasets, covering a wide range of visual and textual concepts, further increasing data acquisition challenges. Fine-tuning a pretrained VLM is more feasible for most organizations, but even that can be expensive if large-scale domain adaptation is needed.
因此,虽然从零开始开发虚拟语言模型(VLM)能够提供完全的控制权和潜在的创新空间,但其成本往往高得令人望而却步。如今,许多实际系统依赖于对开源预训练的VLM进行微调或适配,而不是训练全新的模型。构建多模态随机生成(RAG)系统是训练大型VLM的替代方案。如图2.3所示,在多模态RAG中,独立的检索器从外部源获取相关的文本、图像或混合模态数据,生成器则基于检索到的信息合成响应。这种方法利用现有的多模态嵌入和向量数据库,避免了大规模预训练的需要。它允许灵活地将文本、图像或两者作为下游任务(例如问答、图像描述或摘要)的上下文信息进行集成,使其成为一种……一种更高效、可扩展的多模态人工智能系统部署方法,无需承担端到端训练的高昂成本。
Therefore, while developing a VLM from scratch offers full control and potential innovation, it is often prohibitively expensive. Many practical systems today rely on fine-tuning or adapting open-source pretrained VLMs instead of training entirely new models. An alternative to training large VLMs from scratch is building multimodal RAG systems. In multimodal RAG, as shown in Figure 2.3, separate retrievers fetch relevant text, image, or mixed-modal data from external sources, and a generator synthesizes a response based on the retrieved information. This approach bypasses the need for massive pretraining by leveraging existing multimodal embeddings and vector databases. It allows flexible integration of text, images, or both as context for downstream tasks like QA, captioning, or summarization, making it a more efficient and scalable method for deploying multimodal AI systems without the heavy costs of end-to-end training.
图 2.3:多模态 RAG 系统,使用两种嵌入模型,一种用于文本,一种用于图像。
Figure 2.3: Multimodal RAG system, using two embedding models, one for text and one for image
让我们来了解一下多模态 RAG 系统,它是一种无需从头开始训练大型虚拟语言模型 (VLM) 即可高效构建多模态 AI 能力的方法。该系统能够智能地利用文本和图像数据来检索和生成答案。以下是该过程的详细步骤说明:
Let us understand the multimodal RAG system, an efficient way to build multimodal AI capabilities without training large VLMs from scratch. This system intelligently retrieves and generates answers by leveraging both text and image data. The following is a detailed step-by-step explanation of the process:
查询语句可以是:
该系统必须能够处理不同的模态并对其进行适当的解释。
The query can be:
The system must handle different modalities and interpret them appropriately.
通过将两种模态编码到共享的嵌入空间中,该系统确保可以对文本和图像中的相似概念进行有意义的比较。
By encoding both modalities into a shared embedding space, the system ensures that similar concepts from text and images can be compared meaningfully.
这一步骤确保所有知识资产(文本和图像)都可以通过矢量相似性进行搜索。
This step ensures that all knowledge assets—text and images—are searchable through vector similarity.
此搜索步骤确保获取最相关的知识片段,无论其形式如何。
This search step ensures that the most relevant knowledge pieces, irrespective of modality, are fetched.
这种设计确保模型不会凭空产生答案,而是将答案建立在实际检索到的知识之上。
This design ensures that the model does not hallucinate answers but grounds them in actual retrieved knowledge.
因此,用户可以通过检索和增强高效地获得高质量的响应。
Thus, users receive high-quality responses generated efficiently through retrieval and augmentation.
这种多模态 RAG 架构高效地融合了文本和图像检索与生成能力。它无需进行大规模的虚拟模型预训练,降低了计算成本,并实现了多模态系统的可扩展部署。通过分离检索和生成过程,组织可以利用现有的嵌入模型和逻辑层模型构建强大的 AI 解决方案,使其成为实际多模态 AI 应用的理想选择。
This multimodal RAG architecture efficiently merges text and image retrieval with generative capabilities. It bypasses the need for massive VLM pretraining, reduces computational costs, and enables scalable deployment of multimodal systems. By separating retrieval and generation, organizations can build powerful AI solutions with existing embedding models and LLMs, making it an attractive option for real-world multimodal AI applications.
现在您应该明白,在全人类人工智能(GenAI)时代,跨多种模态(文本、图像、音频、视频和结构化数据)处理数据的能力不再是锦上添花,而是必不可少。多模态红绿灯(RAG)系统正处于这一演进的前沿,它通过从各种数据源检索相关信息来增强逻辑逻辑模型(LLM),从而实现更具上下文关联性、信息量更大、更接近人类的响应。然而,此类系统的有效性很大程度上取决于其底层向量表示,特别是生成多模态向量嵌入的能力,该向量嵌入能够将不同格式的信息统一到一个可比较的、语义丰富的空间中。
Now you know that in the era of GenAI, the ability to work across multiple modalities, text, images, audio, video, and structured data, is no longer a luxury but a necessity. Multimodal RAG systems are at the forefront of this evolution, enabling more context-rich, informative, and human-like responses by augmenting LLMs with relevant information retrieved from diverse data sources. However, the effectiveness of such systems is heavily dependent on their underlying vector representations, specifically, the ability to generate multimodal vector embeddings that unify information across formats in a comparable, semantically rich space.
多模态向量嵌入至关重要,因为它们构成了 RAG 流程中相似性搜索的核心。对于仅限于文档、网页或文本知识库的应用,标准的纯文本 RAG 系统或许就足够了。然而,现实世界的信息通常是多模态的。例如,用户手册包含文本和图表;产品规格包含表格数据和带注释的图像;客户支持互动可能涉及语音转录和屏幕截图。如果系统无法同时理解和检索这些异构格式中的相关信息,就会错过关键信号,导致生成质量欠佳。
Multimodal vector embeddings are essential because they form the backbone of similarity search in a RAG pipeline. A standard text-only RAG system may suffice for applications limited to documents, webpages, or textual knowledge bases. However, real-world information is often multimodal. For example, user manuals contain both text and diagrams; product specifications include tabular data and annotated images; customer support interactions may involve voice transcripts and screenshots. A system that cannot simultaneously understand and retrieve relevant information from these heterogeneous formats will miss critical signals, leading to suboptimal generation quality.
为了实现跨模态检索,每段内容(无论是图像、段落还是音频片段)都必须嵌入到向量空间中。然而,与所有嵌入都源自同一编码器并存在于统一潜在空间的单模态系统不同,多模态系统需要更复杂的设计。如图2.2所示,通常使用不同的编码器(例如,图像使用 CLIP,文本使用Sentence Transformer ,音频使用 Whisper)来生成特定模态的嵌入。这些嵌入必须映射到共享的潜在空间,或者通过索引策略进行链接,以便高效地计算跨模态的相似度。
To enable cross-modal retrieval, each piece of content, whether it is an image, paragraph, or audio clip, must be embedded into a vector space. However, unlike unimodal systems, where all embeddings are derived from the same encoder and live in a uniform latent space, multimodal systems require a more sophisticated design. As explained in Figure 2.2, separate encoders (e.g., CLIP for images, Sentence Transformers for text, and Whisper for audio) are often used to generate modality-specific embeddings. These embeddings must then either be mapped into a shared latent space or linked via indexing strategies that allow for efficient similarity computation across modalities.
例如,假设用户上传了一张笔记本电脑侧面轮廓图,并询问“给我展示带有类似这种接口的笔记本电脑” 。单模态 RAG 系统无法识别这张图片。相比之下,采用联合向量嵌入的多模态 RAG 系统可以将这张图片与数据库中存储的类似笔记本电脑接口图进行匹配,并检索相应的产品规格和评价。这种检索之所以可行,是因为视觉信息和文本信息都被表示为共享或对齐空间中的向量,从而保留了语义信息。
For example, consider a user asking, show me laptops with ports like this, while uploading an image of a laptop side profile. A unimodal RAG system would fail to interpret the image. In contrast, a multimodal RAG system with joint vector embeddings can match the image to similar laptop port diagrams stored in the database and retrieve corresponding product specifications and reviews. This retrieval is only possible because the visual and textual information are both represented as vectors in a shared or aligned space that preserves semantic meaning.
多模态向量嵌入也增强了查询构建的灵活性。用户可以输入图像、文本,甚至二者的组合,系统可以将其与相关的文档、图表或知识库进行匹配。这使得系统更加直观和包容,能够跨越语言障碍,并方便那些可能没有精确关键词但拥有视觉或听觉线索的用户。
Multimodal vector embeddings also enhance the flexibility of query formulation. Users can input images, text, or even a combination of both, and the system can match them against relevant documents, diagrams, or knowledge chunks. This makes the system more intuitive and inclusive, bridging language barriers and accommodating users who may not have the precise keywords but possess visual or auditory cues.
此外,在专为医疗保健、法律或制造业等高风险领域设计的 RAG 系统中,使用多模态嵌入可以确保答案生成拥有更全面的证据基础。它通过将答案生成与真实的多模态数据样本挂钩,而不是仅仅依赖于先验模型知识,从而降低了出现幻觉的风险。
Furthermore, in RAG systems designed for high-stakes domains like healthcare, legal, or manufacturing, the use of multimodal embeddings ensures a more comprehensive evidence base for answer generation. It reduces the risk of hallucinations by anchoring the generation to real, multimodal data artifacts rather than relying purely on prior model knowledge.
一旦生成了多模态向量嵌入(用于在共享语义空间中表示文本、图像或两者),就必须对其进行高效存储和检索,以支持实时人工智能应用。这时,多模态向量数据库就显得至关重要。它提供了一个结构化的高性能存储系统,针对不同模态嵌入的相似性搜索进行了优化。通过将这些嵌入与元数据(例如语言、时间戳)一起组织起来,向量数据库能够实现快速、过滤的近似最近邻(ANN )检索。从嵌入到向量数据库的这种转变对于构建可扩展的跨模态系统(例如多模态红绿灯算法、推荐引擎和语义搜索平台)至关重要。
Once multimodal vector embeddings are generated, representing text, images, or both in a shared semantic space, they must be efficiently stored and retrieved to support real-time AI applications. This is where a multimodal vector database becomes essential. It provides a structured, high-performance storage system optimized for similar search across embeddings from different modalities. By organizing these embeddings alongside metadata (e.g., language, timestamp), the vector database enables fast, filtered approximate nearest neighbor (ANN) retrieval. This transition from embeddings to a vector database is crucial for powering scalable, cross-modal systems such as multimodal RAG, recommendation engines, and semantic search platforms.
例如:
Examples:
在使用多模态向量数据库时,例如使用 Qdrant 作为存储和检索高维多模态嵌入的向量数据库,您必须考虑一些关键的设计选择。
You have to touch on some critical design choices when using a multimodal vector database, let us say Qdrant as a vector database for storing and retrieving high-dimensional multimodal embeddings.
让我们结合流行的矢量数据库 Qdrant,具体了解几个关键概念。虽然大多数矢量数据库的运行原理类似,但逐一详细介绍每个数据库超出了本章和本书的范围。
Let us understand a few key concepts specifically in the context of Qdrant, a popular vector database. While most vector databases operate on similar principles, detailing each one individually is beyond the scope of this chapter and book.
在 Qdrant 中,集合是基本的组织单元。它本质上是一组带有标签的数据点,这些数据点共享一个共同的结构。集合中的每个点都与一个固定大小的向量相关联,并使用特定的相似度度量(例如,余弦相似度、点积、欧氏距离)进行比较。同一集合中的所有向量必须遵循统一的维度和距离函数。Qdrant 还允许在单个点中使用不同的名称存储多个向量,称为命名向量,每个命名向量可以单独遵循不同的度量和维度设置。
A collection is the fundamental organizational unit in Qdrant. It is essentially a labeled group of data points that share a common structure. Each point in a collection is associated with a vector of a fixed size and is compared using a specific similarity metric (e.g., cosine, dot product, Euclidean). All vectors in the same collection must adhere to this uniform dimensionality and distance function. Qdrant also allows multiple vectors to be stored under different names within a single point called named vectors, which can individually follow different metric and dimension settings.
在 Qdrant 中,点是指集合中的一个独立条目。它包含以下内容:
In Qdrant, a point is an individual entry within a collection. It comprises of the following:
这些点是用户使用向量相似度进行搜索的基本单元。点 ID 用于检索、更新或删除特定记录。所有与点相关的操作,包括插入或更新,都会首先被记录,以确保数据的持久性和恢复能力,即使在断电的情况下也能正常运行。
These points are the basic units that users search against using vector similarity. The point ID is used to retrieve, update, or delete specific records. All point-related operations, including insertions or updates, are first logged to ensure durability and recovery, even in the event of power failure.
向量(也称为嵌入)表示各种数据类型(例如图像、文本或音频)的编码数值形式。这些向量使得在高维空间中比较不同的数据对象成为可能。两个向量在这个空间中越接近,它们所代表的原始对象就越相似。为了生成这些嵌入,通常使用神经网络,该网络经过训练以学习有意义的模式,通常基于对已标记或弱标记数据的对比学习。向量是相似性搜索的基石,并被应用于聚类、排序和检索任务中。
Vectors (also known as embeddings) represent the encoded numerical form of various data types, such as images, text, or audio. These vectors enable the comparison of different data objects in high-dimensional space. The closer two vectors are in this space, the more similar their original objects are considered to be. To generate these embeddings, one typically uses a neural network trained to learn meaningful patterns, often based on contrastive learning from labeled or weakly labeled data. Vectors are the cornerstone of similarity search and are used in clustering, ranking, and retrieval tasks.
有效载荷是指与每个向量一起存储的附加元数据。这些元数据非常灵活,可以采用任何 JSON 兼容的结构。它可以描述语言、时间戳、用户信息、类别或任何特定领域的标签等属性。有效载荷使 Qdrant 能够执行过滤搜索,允许用户将相似性搜索限制在具有特定属性的向量上。 元数据属性。例如,仅检索英文文档或按日期筛选。
The payload refers to additional metadata stored alongside each vector. This metadata is flexible and can take any JSON-compatible structure. It can describe attributes like language, timestamp, user information, category, or any domain-specific tags. Payloads allow Qdrant to perform filtered searches, letting users restrict similarity searches to vectors with certain metadata properties. For example, retrieving only English-language documents or filtering by date.
Qdrant 将数据组织成各个集合内的分段。每个分段都维护着自己的一组向量、有效载荷和索引。分段针对不同的使用场景进行了优化,例如:
Qdrant organizes its data into segments within each collection. Each segment maintains its own set of vectors, payloads, and indexes. Segments are optimized for different use cases, like the following:
Qdrant 支持两种存储模型,分别是:
Qdrant supports two storage models, which are as follows:
这种架构确保可以根据应用程序需求调整性能和成本。
This architecture ensures that performance and cost can be tuned based on application requirements.
在实际的多模态 GenAI 系统中,高效的数据管理和检索对于在图像和文本模态上提供快速、准确的响应至关重要。高性能矢量数据库 Qdrant 通过结合矢量索引和有效载荷过滤,实现了这一目标,确保语义相似性和结构化元数据约束都能得到无缝处理。Qdrant 利用集合、点级元数据(有效载荷)以及来自 CLIP 或 BLIP 等模型的高维嵌入,支持混合搜索——根据语义和产品类别或颜色等过滤器检索相关项。这些索引策略,包括分层可导航小世界( HNSW ) 索引和有效载荷索引,确保 GenAI 应用在保持低延迟性能的同时,实现可靠的扩展。Qdrant 同时支持矢量索引和有效载荷(过滤器)索引,从而实现高效的混合搜索:
In a real-world multimodal GenAI system, efficient data management and retrieval are essential for delivering fast, accurate responses across image and text modalities. Qdrant, a high-performance vector database, enables this by combining vector indexing and payload filtering, ensuring both semantic similarity and structured metadata constraints are handled seamlessly. By leveraging collections, point-level metadata (payloads), and high-dimensional embeddings from models like CLIP or BLIP, Qdrant facilitates hybrid search—retrieving relevant items based on meaning and filters like product category or color. These indexing strategies, including hierarchical navigable small world (HNSW) and payload indexes, ensure GenAI applications scale reliably while maintaining low-latency performance. Qdrant supports both vector indexing and payload (filter) indexing, allowing efficient hybrid search:
索引虽然能提高速度和准确性,但也会增加内存和处理成本。用户可以根据预期的查询模式和基数,选择性地配置要建立索引的字段。索引参数在集合级别定义,但索引在段中的实际存在与否取决于优化规则和数据分布。
While indexing improves speed and accuracy, it incurs additional memory and processing costs. Users can selectively configure which fields should be indexed based on their expected query patterns and cardinality. Index parameters are defined at the collection level, but the actual index presence in segments depends on optimization rules and data distribution.
让我们把您在 Qdrant 中学到的所有概念直接整合到构建多模态 GenAI 系统的背景下。
Let us integrate all the Qdrant concepts you learned about directly into the context of building a Multimodal GenAI system.
在实际的多模态GenAI系统中,管理和检索跨模态(例如文本和图像)的数据不仅仅是嵌入向量;更重要的是高效地大规模组织、过滤和检索这些数据。正因如此,诸如集合、点、向量、有效载荷和索引等概念(如Qdrant等向量数据库中所实现的)才显得至关重要。
In a practical multimodal GenAI system, managing and retrieving data across modalities, like text and images, is not just about embedding vectors; it is about organizing, filtering, and retrieving them efficiently at scale. This is where concepts like collections, points, vectors, payloads, and indexes, as implemented in vector databases such as Qdrant, become critically important.
此类系统的核心是向量嵌入过程。对于每种输入数据类型,例如产品图像或其描述,神经网络模型(例如 CLIP 或 BLIP)会将输入转换为高维向量。这些向量捕捉语义信息,因此“红色跑车”这样的描述和红色跑车的图像会生成在向量空间中彼此接近的嵌入向量。然后,这些嵌入向量被分组到集合中,每个集合代表一个逻辑数据集片段。例如,一个集合可以存储所有与零售产品数据相关的向量,图像和文本作为命名向量存储在每个点下。
At the core of such a system is the vector embedding process. For each input data type, such as a product image or its description, a neural network model (e.g., CLIP or BLIP) converts the input into a high-dimensional vector. These vectors capture semantic meaning, so a caption like a red sports car and an image of a red sports car will generate embeddings that lie close to each other in the vector space. These embeddings are then grouped into collections, each representing a logical dataset segment. For example, a single collection may store all vectors related to retail product data, with images and text stored as named vectors under each point.
该集合中的每个点代表一个单独的项目,例如一个产品实例,并被分配一个唯一的点 ID。除了向量之外,点还可以包含有效载荷,其中存储着有用的元数据,例如语言、时间戳、产品类别,甚至是原始文件来源。在多模态 GenAI 设置中,当我们在检索过程中需要按模态、时间范围或其他条件筛选结果时,此有效载荷就显得至关重要。
Each point within this collection represents an individual item, say, a product instance, and is assigned a unique point ID. Alongside the vector(s), a point can include a payload, which stores useful metadata such as language, timestamp, product category, or even the original file source. In a multimodal GenAI setup, this payload becomes crucial when we want to filter results by modality, time range, or other criteria during retrieval.
当用户输入查询时,例如上传一张产品图片并附带类似“显示蓝色款的类似型号”这样的文字请求,系统需要执行混合搜索。这意味着不仅要基于向量相似度检索结果,还要使用有效负载中定义的约束条件(例如,color = "blue" )。为了实现这一点,Qdrant 支持有效负载索引,它允许对结构化元数据字段进行快速筛选,类似于传统关系数据库中的索引。
When a user inputs a query, perhaps a product photo with a textual request like show similar models available in blue, the system needs to perform a hybrid search. This means retrieving results not only based on vector similarity but also using constraints defined in the payload (e.g., color = "blue"). To enable this, Qdrant supports payload indexing, which allows fast filtering across structured metadata fields, much like indexes in traditional relational databases.
在后台,数据集被划分为多个段,每个段都有自己的存储和索引配置。根据性能需求,这些段可能使用内存存储以实现最高速度,也可能使用内存映射存储来优化 RAM 使用,同时仍能通过操作系统级页面缓存实现快速访问。对于服务数百万用户的实际 GenAI 应用而言,这种存储分段方式可确保可扩展性和容错性。
Behind the scenes, the collection is divided into segments, each with its own storage and indexing configuration. Depending on performance requirements, these segments may use in-memory storage for maximum speed or memory-mapped storage to optimize RAM usage while still enabling fast access via the OS-level page cache. For a real-world GenAI application that serves millions of users, segmenting storage this way ensures both scalability and fault tolerance.
最后,为了加速向量检索,Qdrant 支持高性能向量索引(例如 HNSW),使系统能够在无需暴力比较的情况下,快速逼近高维空间中的最近邻。结合有效载荷滤波器,这种索引策略能够实现精确控制的 ANN 检索,这对于实时多模态系统至关重要。
Finally, to accelerate vector retrieval, Qdrant supports high-performance vector indexes (e.g., HNSW) that allow the system to quickly approximate the nearest neighbors in high-dimensional space without brute-force comparison. Combined with payload filters, this indexing strategy enables ANN retrieval with precise control, which is vital for real-time multimodal systems.
在存储和搜索高维多模态嵌入时,通常有两种策略:使用带有过滤器的单个集合,以及创建多个带有局部索引的集合。当嵌入频繁更新时(例如在生产流程中每天更新),以及当查询需要细粒度控制时,这些设计决策尤为重要。
Two common strategies emerge when storing and searching high-dimensional multimodal embeddings: using a single collection with filters vs. creating multiple collections with localized indexing. These design decisions are especially important when embeddings are updated frequently, such as every day in a production pipeline, and when queries require fine-grained control.
在这种方法中,所有向量嵌入,无论源自图像、文本还是多模态文档,都存储在一个统一的集合中。数据类型(例如日期或语言)的区分通过有效负载元数据来实现。例如,每个数据点可能带有类似{"date": "2025-05-09", "language": "en"} 的标签,这些标签在查询执行期间用作过滤器。
In this approach, all vector embeddings, whether derived from images, text, or multimodal documents, are stored in a single, unified collection. The differentiation between data types, such as dates or languages, is handled via payload metadata. For example, each point might carry tags like {"date": "2025-05-09", "language": "en"}, which are then used as filters during query execution.
这种设置简单且可扩展。只需维护一个数据集,所有嵌入都可以在同一个向量空间中搜索。在操作上,它成本效益高,并且易于与下游系统集成。然而,由于没有构建跨子集(例如,按日期或语言)的全局索引,人工神经网络(ANN)的检索准确率显著降低,与精确的K近邻(KNN)搜索相比,匹配率仅为50%左右。
This setup is simple and scalable. There is only one collection to maintain, and all embeddings are searchable in a single vector space. Operationally, it is cost-efficient and easy to integrate with downstream systems. However, because no global index is built across subsets (e.g., by date or language), the ANN retrieval accuracy is significantly lower, dropping to around a 50% match rate compared to exact KNN searches.
仅通过有效载荷过滤嵌入向量而不进行全局索引会造成效率低下,尤其是在数据集增长或数据在时间和类别上出现偏差时。例如,如果某个日期包含的数据量异常多,或者某种特定语言的数据量占主导地位,则由于过滤后的子集中向量分布不均,人工神经网络 (ANN) 搜索可能会降低精度。下图展示了一个通过有效载荷方法进行分割的数据集:
Filtering embeddings purely through payload without global indexing introduces inefficiencies, especially when the dataset grows or becomes skewed across time and classes. For example, if one date contains disproportionately more data or a specific language dominates, ANN search may lose precision due to uneven vector distribution across the filtered subsets. The following figure depicts a single collection, portioned via a payload approach:
使用案例:此方法最适合那些优先考虑易于维护和成本控制而非完美检索准确性的环境,例如非关键检索任务或早期原型。
Use case: This method is best for environments where ease of maintenance and cost control are prioritized over perfect retrieval accuracy, such as in non-critical retrieval tasks or early-stage prototypes.
第二种策略选择按日期划分集合,每天创建一个集合(例如,embeddings_2025_05_08、embeddings_2025_05_09 )。在每个集合中,显式构建一个全局向量索引(例如,HNSW),以实现高度优化的人工神经网络检索。然后,可以使用有效载荷过滤器按语言进一步划分每个集合。
The second strategy opts for separating collections by date, creating one collection per day (e.g., embeddings_2025_05_08, embeddings_2025_05_09). Within each collection, a global vector index (e.g., HNSW) is explicitly built to enable highly optimized ANN retrieval. Each collection can then be partitioned further by language using payload filters.
这种方法显著提高了基于人工神经网络(ANN)的搜索精度——与精确KNN相比,匹配率高达98%——因为每个集合都受益于局部索引和更均匀的嵌入分布。通过将搜索空间缩小到单个日期并仅在该时间段内进行过滤,该系统避免了大型全局集合中常见的向量聚类稀释问题。
This approach results in significantly higher precision during ANN-based searches—up to 98% match rate compared to exact KNN—because each collection benefits from localized indexing and a more homogeneous embedding distribution. By narrowing the search space to a single date and filtering only within that segment, the system avoids the dilution of vector clusters that occurs in large, global collections.
然而,这种模式是有代价的。维护多个集合会增加操作复杂性,系统必须管理每个新集合的索引成本。此外,随着时间的推移(例如,按小时或按用户),扩展到大量集合可能会导致资源效率低下和存储开销。
However, this model comes at a cost. Maintaining multiple collections increases operational complexity, and the system must manage the indexing cost for each new collection. Additionally, scaling to many collections over time (e.g., per hour or per user) may lead to resource inefficiency and storage overhead.
使用场景:当需要高精度推荐或精确语义搜索时,例如在产品推荐引擎、个性化助手或关键分析管道中,此模型是理想之选。
Use case: This model is ideal when high-accuracy recommendations or precise semantic search are required, such as in product recommendation engines, personalized assistants, or critical analytics pipelines.
下图展示了采用全局索引方法的多个集合:
The following figure depicts multiple collections with a global indexing approach:
在探讨了如何使用向量数据库组织、存储和检索多模态向量嵌入之后,我们现在将重点转向如何将这些功能应用于端到端人工智能系统。向量数据库是高效存储和搜索的基础。在各种模态下,基于此基础设施的架构选择,特别是使用多模态 GenAI 系统还是虚拟线性模型 (VLM),会对性能、可扩展性和应用契合度产生显著影响。在接下来的章节中,我们将探讨这两种方法之间的根本区别,并分析各自的适用场景。
Having explored how multimodal vector embeddings are organized, stored, and retrieved using vector databases, we now shift our focus to how these capabilities are applied in end-to-end AI systems. While vector databases serve as the backbone for efficient storage and search across modalities, the architectural choices made on top of this infrastructure, particularly whether to use a multimodal GenAI system or a VLM, can significantly influence performance, scalability, and application fit. In the following section, we examine the fundamental differences between these two approaches and look at when each should be used.
随着人工智能的演进,对能够理解和生成多种数据模态(例如文本、图像、音频或结构化数据)的系统的需求显著增长。两种方法已脱颖而出,成为这一发展趋势的前沿:虚拟语言模型(VLM)和更广泛的多模态生成人工智能(GenAI)系统。虽然这两个术语有时可以互换使用,但它们用途不同,架构原则也不同。本节将阐明它们的区别,并就何时最适合应用每种方法提供指导。
As AI evolves, the demand for systems that can understand and generate across multiple data modalities, such as text, images, audio, or structured data, has grown significantly. Two approaches have emerged at the forefront of this advancement: VLMs and broader multimodal GenAI systems. While the terms are sometimes used interchangeably, they serve distinct purposes and operate under different architectural principles. This section clarifies their differences and offers guidance on when each is best applied.
视觉语言模型(VLM)是多模态人工智能系统的一个子集,专门用于整合视觉和文本模态。这些模型经过训练,能够理解图像特征并与语言特征相匹配,从而实现图像描述、视觉问答、图像文本检索和跨模态推理等任务。
VLMs are a subset of multimodal AI systems that specifically integrate visual and textual modalities. These models are trained to understand and align image features with language features, enabling tasks such as image captioning, VQA, image-text retrieval, and cross-modal reasoning.
视觉学习模型(VLM)通常采用融合两种独立神经编码器嵌入的架构:视觉编码器(例如 ViT、ResNet)和文本编码器(例如 BERT、RoBERTa 或 GPT)。融合后的表示使模型能够跨模态进行推理。一些模型使用交叉注意力机制,使图像标记能够关注文本标记,反之亦然;而另一些模型则使用对比学习(例如 CLIP、ALIGN)将图像和文本映射到共享的潜在空间以进行检索。
VLMs are typically built using architectures that fuse embeddings from two separate neural encoders: a vision encoder (e.g., ViT, ResNet) and a text encoder (e.g., BERT, RoBERTa, or GPT). The fused representation allows the model to reason across both modalities. Some models use cross-attention mechanisms to allow image tokens to attend to text tokens, and vice versa, while others use contrastive learning (e.g., CLIP, ALIGN) to map images and texts into a shared latent space for retrieval.
以下是VLM的示例:
The following are examples of VLMs:
VLM 通常使用自监督或半监督学习目标,在大型图像-文本数据集上进行预训练,并且可以针对需要视觉语言对齐的下游任务进行微调。
VLMs are often pretrained on large image-text datasets, using self-supervised or semi-supervised learning objectives, and can be fine-tuned on downstream tasks requiring vision-language alignment.
相比之下,多模态GenAI系统旨在跨多种模态运行,这些模态通常是任意的,而不仅限于视觉和语言。这些系统结合了检索、推理和生成等组件,通常通过模块化架构进行协调。
In contrast, multimodal GenAI systems are designed to operate across multiple and often arbitrary modalities, not limited to vision and language. These systems combine components for retrieval, reasoning, and generation, often orchestrated through modular architectures.
一个关键区别在于多模态 GenAI 系统中常用的检索增强架构。这些系统并非仅仅依赖单个预训练模型,而是:
A key difference is the retrieval-augmented architecture often used in multimodal GenAI systems. Instead of relying solely on a single pretrained model, these systems:
多模态GenAI系统可以将虚拟线性模型(VLM)作为子组件集成,但并不局限于此。它们通常支持以下功能:
Multimodal GenAI systems can incorporate VLMs as subcomponents but are not limited to them. They often support the following:
这些系统通常基于流水线,结合不同的模型和检索层来执行任务。RAG(资源获取、请求生成)、编排层(例如 LangChain 或 LangGraph)以及工具的使用(通过代理)都很常见。
These systems are typically pipeline-based, combining different models and retrieval layers to perform a task. RAG, orchestration layers (like LangChain or LangGraph), and tool use (via agents) are common.
让我们快速了解一下建筑上的差异:
Let us look at the architectural differences at a glance:
|
特征 Feature |
VLM VLMs |
多模态基因人工智能系统 Multimodal GenAI systems |
|
支持的模式 Modalities supported |
视觉+纯文本 Vision + text-only |
任何形式(文本、图像、音频、视频、表格) Any modality (text, image, audio, video, tables) |
|
模型结构 Model structure |
端到端统一变压器 End-to-end unified transformer |
模块化管道,带有独立的回收器和发生器 Modular pipeline with separate retrievers and generators |
|
典型用例 Typical use case |
字幕制作、视频质量保证、检索 Captioning, VQA, retrieval |
多模态聊天、文档分析、RAG(红绿灯)、复杂工作流程 Multimodal chat, document analysis, RAG, complex workflows |
|
数据来源 Data sources |
基于图像-文本对进行预训练 Pretrained on image-text pairs |
与数据库、API、工具和内存集成 Integrates with DBs, APIs, tools, and memory |
|
检索层 Retrieval layer |
并非总是存在 Not always present |
建筑的组成部分 Integral part of architecture |
|
灵活性和定制性 Flexibility and customization |
缓和 Moderate |
高的 High |
|
使用代理或编排 Use of agents or orchestration |
稀有的 Rare |
通用(LangChain、LlamaIndex 等) Common (LangChain, LlamaIndex, etc.) |
|
可扩展性 Scalability |
受型号尺寸限制 Limited by model size |
可扩展性强,支持检索和模块化 Scalable with retrieval and modularity |
Table 2.1: Comparison of VLMs vs. multimodal GenAI systems
在以下情况下应使用 VLM:
You should use VLMs in the following cases:
简而言之,VLM 非常适合受控的视觉语言任务,这些任务可以从单个模型中的深度跨模态表征学习中受益。
In short, VLMs are ideal for controlled, vision-language tasks that benefit from deep cross-modal representation learning in a single-model.
在以下情况下使用多模态GenAI系统:
Use multimodal GenAI systems when:
从本质上讲,多模态 GenAI 适用于需要高度适应性、实时数据集成和 AI 组件复杂编排的企业级、多用途应用。
In essence, multimodal GenAI is suited for enterprise-grade, multi-purpose applications that demand high adaptability, live data integration, and sophisticated orchestration of AI components.
让我们根据一份包含文字和图片的产品手册来回答一个问题。
Let us take the task of answering a question based on a product manual that includes both text and figures.
这种涵盖文本、图像、布局和逻辑的端到端能力是多模态 GenAI 的标志。
This end-to-end capability across text, image, layout, and logic is the hallmark of multimodal GenAI.
多模态全人类人工智能系统不仅能够处理多种类型的输入,例如文本、图像、音频或结构化数据,而且还能够生成多种输出。随着各组织在电子商务、医疗保健、软件开发和知识管理等领域部署人工智能系统,了解这些系统的重要性日益凸显。根据输出性质对多模态系统进行分类。这种分类有助于更好地进行架构设计、模型选择,并与下游用例保持一致。
Multimodal GenAI systems are distinguished not only by their ability to process diverse types of input, such as text, images, audio, or structured data, but also by the variety of outputs they are capable of producing. As organizations deploy AI systems across sectors like e-commerce, healthcare, software development, and knowledge management, it becomes important to classify multimodal systems based on the nature of their output. This classification allows for better architectural design, model selection, and alignment with downstream use cases.
本节介绍了一种基于多模态系统生成的输出类型对其进行分类的框架,重点关注六个核心类别:
This section introduces a framework for classifying multimodal systems based on the type of output they generate, focusing on six core categories:
这些类别中的每一个都反映了独特的世代发展路径,它们各自具有不同的模式、挑战和应用。
Each of these categories reflects a unique generation pathway with its own models, challenges, and applications.
文本到图像的生成是多模态人工智能领域的一项突破性技术,它使系统能够将自然语言提示转化为生动且上下文准确的图像。这一过程的核心是强大的生成模型,例如 DALL·E 2、Stable Diffusion、Imagen 和 Parti,它们能够学习文本语义和视觉特征之间复杂的映射关系。这些系统通常将基于 Transformer 的文本编码器与扩散或自回归解码器相结合,有时还会借助超分辨率模块进行增强。其应用范围涵盖创意设计、广告、娱乐和个性化媒体等领域。尽管这项技术前景广阔,但在提示与图像的对齐、纹理保真度以及消除偏差等方面仍然存在挑战,这也凸显了当前研究人员致力于提升生成图像的真实性、可控性和公平性的努力。
Text-to-image generation is a breakthrough capability in multimodal AI, enabling systems to transform natural language prompts into vivid, contextually accurate images. At the heart of this process are powerful generative models like DALL·E 2, Stable Diffusion, Imagen, and Parti, which learn complex mappings between textual semantics and visual features. These systems typically combine transformer-based text encoders with diffusion or autoregressive decoders, sometimes enhanced by super-resolution modules. Applications span creative design, advertising, entertainment, and personalized media. Despite their promise, challenges remain in prompt-image alignment, texture fidelity, and mitigating biases, highlighting ongoing research efforts to improve realism, controllability, and fairness in generated outputs.
文本到图像的生成是指仅基于自然语言描述生成视觉表示(图像)的过程。这些系统使用强大的生成模型将描述性输入转换为详细且具有上下文感知能力的图像:
Text-to-image generation refers to the process of generating a visual representation (image) based solely on a natural language description. These systems translate descriptive input into detailed and context-aware images using powerful generative models:
这些模型使用扩散技术或基于 Transformer 的架构来学习语义文本输入和视觉输出之间的映射关系。
These models use either diffusion techniques or transformer-based architectures to learn mappings between semantic textual inputs and visual outputs.
图像到文本生成系统使机器能够使用自然语言解释和描述视觉内容,从而弥合视觉和语言之间的鸿沟。这些系统超越了基本的字幕生成功能,能够从图表、场景或示意图等复杂的视觉对象中提供丰富的摘要或结构化的见解。它们采用 BLIP、MiniGPT-4 和 Flamingo 等模型,将视觉编码器与语言解码器相结合,从图像中生成连贯的文本。这些模型基于精心挑选或自监督的数据集进行训练,支持辅助功能、内容管理和视觉质量保证 (VQA) 等领域的应用:
Image-to-text generation systems empower machines to interpret and describe visual content using natural language, bridging the gap between vision and language. These systems go beyond basic captioning to deliver rich summaries or structured insights from complex visuals like charts, scenes, or diagrams. Powered by models such as BLIP, MiniGPT-4, and Flamingo, they combine vision encoders with language decoders to generate coherent text from images. Trained on curated or self-supervised datasets, these models support applications in accessibility, content management, and VQA:
这些模型可以使用 COCO 等监督数据集或自监督图像描述对进行训练。
These models can be trained using supervised datasets like COCO or self-supervised image caption pairs.
这类多模态系统以文本和图像作为输入,并生成修改或合成的图像作为输出。这些模型通常根据输入提示进行引导式生成或编辑。
This class of multimodal systems takes both text and image as input and produces a modified or synthesized image as output. These models often perform guided generation or editing based on the input prompt.
文本和图像系统代表了一种先进的多模态人工智能,它同时利用视觉和文本输入来指导图像的生成或编辑。与传统的文本到图像模型不同,这些系统会根据现有图像和描述性提示来生成输出,从而实现对修改的精细控制。诸如 InstructPix2Pix、ControlNet 和 Paint by Text 等模型利用双编码器来提取和融合视觉和语言特征,生成具有上下文感知能力的视觉输出。其应用范围涵盖智能照片编辑、视觉个性化以及设计原型制作等。然而,如何在保证提示信息的准确性和图像完整性之间取得平衡仍然是一项挑战——既要确保结构一致性、对象完整性,又要实现逼真的变换,同时避免过度改变源图像。让我们来详细了解一下:
Text and image systems represent an advanced category of multimodal AI where both visual and textual inputs are used to guide image generation or editing. Unlike traditional text-to-image models, these systems condition outputs on an existing image and a descriptive prompt—enabling fine-grained control over modifications. Models like InstructPix2Pix, ControlNet, and Paint by Text leverage dual encoders to extract and merge visual and linguistic features, producing context-aware visual outputs. Applications range from intelligent photo editing and visual personalization to design prototyping. However, challenges persist in balancing prompt fidelity with image integrity—ensuring structural consistency, object preservation, and realistic transformations without over-altering the source image. Let us understand it in detail:
这些模型通过引入参考图像或条件化机制来扩展基本的文本到图像处理流程。
These models extend basic text-to-image pipelines by incorporating reference images or conditioning mechanisms.
这些系统接收基于文本的提示(通常是描述性的或功能性的),并生成结构化的规范(例如,物料清单、布局图或产品蓝图)和相应的视觉输出。
These systems take a text-based prompt, often descriptive or functional in nature, and generate both a structured specification (e.g., a bill of materials, layout plan, or product blueprint) and a corresponding visual output.
这项任务需要对文本中表达的意图有深刻的理解,并且能够生成与该意图相符的多模态输出。
This task requires a deep understanding of both the intent expressed in text and the ability to generate multimodal outputs aligned with that intent.
让我们来看一些使用案例:
Let us look at some example use cases:
在新兴的多模态系统中,文本到规范的转换可以与图像架构相结合,从而将结构化输出生成的精确性与视觉合成的创造性融合起来。这些系统能够解读用户提示,生成机器可读的规范(例如 JSON、YAML)和相应的图像,从而实现从概念到设计的无缝过渡。语言模型将用户意图解码为结构化数据,而文本到图像生成器则将相同的概念可视化,通常会基于共享的潜在特征来确保一致性。这种架构在人工智能辅助设计、产品定制和数字原型制作等应用中至关重要,但它在保持语义准确性和同步双重输出方面面临着挑战:
In emerging multimodal systems, text to specs can also be included with image architectures that combine the precision of structured output generation with the creativity of visual synthesis. These systems interpret user prompts to produce both machine-readable specifications (e.g., JSON, YAML) and corresponding images—enabling seamless transitions from concept to design. A language model decodes intent into structured data, while a text-to-image generator visualizes the same concept, often conditioned on shared latent features to ensure alignment. This architecture is key in applications like AI-assisted design, product customization, and digital prototyping, though it faces challenges in maintaining semantic accuracy and synchronizing dual outputs:
文本到 SQL 系统通过将用户查询转换为可执行的 SQL 语句,将自然语言理解与结构化数据库查询相结合。这些系统无需用户了解 SQL 语法即可实现直观的数据访问。在高级多模态配置中,这些模型可以整合其他输入,例如表格、文档或图像(例如扫描的发票),以及文本,从而生成准确且上下文相关的 SQL 查询。这些系统由 SQLCoder 等模型以及 PICARD + T5 等模式约束变体提供支持,并在 Spider 和 CoSQL 等基准测试中进行评估,从而拓展了数据库交互和企业分析自动化的边界。让我们来看看它们的详细信息:
Text-to-SQL systems bridge natural language understanding with structured database querying by translating user queries into executable SQL statements. These systems enable intuitive data access without requiring users to know SQL syntax. In advanced multimodal configurations, the models can incorporate additional inputs, such as tables, documents, or images (e.g., scanned invoices), alongside text to generate accurate, context-aware SQL queries. Powered by models like SQLCoder and schema-constrained variants such as PICARD + T5, these systems are evaluated on benchmarks like Spider and CoSQL, pushing the boundaries of database interaction and enterprise analytics automation. Let us look at their details:
在一些高级系统中,文档嵌入和多模态信号被用来动态地指导 SQL 生成。
In some advanced systems, document embeddings and multimodal signals are used to dynamically guide SQL generation.
文本转代码系统能够将自然语言指令自动翻译成可执行代码,从而简化软件开发流程并加速自动化。这些系统利用 Codex、Code Llama 和 StarCoder 等强大的代码导向型语言模型,可以生成从简单函数到功能齐全的应用程序的各种代码,并支持多种编程语言。作为人工智能领域增长最快的技术之一,文本转代码技术正在重塑开发人员构建软件原型、调试和开发的方式。这些模型可应用于集成开发环境 (IDE)、低代码平台和开发者辅助等领域,从而降低技术门槛,并提高各种编码任务的效率。请参考以下列表,深入了解文本转代码系统:
Text-to-code systems enable the automatic translation of natural language instructions into executable code, streamlining software development and accelerating automation. Leveraging powerful code-focused language models like Codex, Code Llama, and StarCoder, these systems can generate anything from simple functions to full-fledged applications across programming languages. As one of the fastest-growing areas in GenAI, text-to-code technology is reshaping how developers prototype, debug, and build software. With applications in IDE integration, low-code platforms, and developer assistance, these models reduce technical barriers and boost productivity across a wide range of coding tasks. Refer to the following list to build an understanding of text-to-code systems:
这些模型在大规模编程语料库(例如 GitHub)上进行训练,并针对指令遵循进行微调。
These models are trained on large-scale programming corpora (e.g., GitHub) and fine-tuned for instruction-following.
在某些多模态设置中,可以将可视化图表(例如流程图或统一建模语言( UML ))与提示相结合,以生成与可视化逻辑一致的代码。
In some multimodal setups, visual diagrams (e.g., flowcharts or Unified Modeling Language (UML)) can be paired with prompts to generate code that aligns with visual logic.
根据输出类型对多模态系统进行分类,有助于清晰了解系统功能、架构要求和部署准备情况。虽然这六类分类并非详尽无遗,但它们代表了当今最常见的生产级应用场景。
Classifying multimodal systems based on output type provides clarity on system capabilities, architectural requirements, and deployment readiness. While these six classes are not exhaustive, they represent the most common production-grade use cases emerging today.
多模态人工智能系统可以根据其生成的输出类型和所需的输入进行分类。这种分类有助于理解不同的文本和图像输入组合如何产生不同的输出,例如图像、代码、SQL 或结构化规范。下表概述了主要输出类型、它们对应的输入模态以及代表性用例类别,涵盖从创意设计和个性化到数据分析、自动化和辅助功能等各个方面:
Multimodal AI systems can be categorized based on the type of output they generate and the inputs they require. This classification helps in understanding how different combinations of text and image inputs lead to varied outputs such as images, code, SQL, or structured specifications. The following table outlines key output types, their corresponding input modalities, and representative use case categories, ranging from creative design and personalization to data analytics, automation, and accessibility:
|
输出类型 Output type |
所需输入 Inputs required |
生成的输出 Output generated |
用例类别 Use case category |
|
文本转图像 Text-to-image |
文本 Text |
图像 Image |
设计、营销、创意人工智能 Design, marketing, creative AI |
|
图像转文本 Image-to-text |
图像 Image |
文本 Text |
可访问性、搜索、索引 Accessibility, search, indexing |
|
文本+图像到图像 Text + image-to-image |
文字+图片 Text + image |
图像 Image |
引导式编辑,个性化 Guided editing, personalization |
|
文本转规格 + 图片 Text to specs + image |
文本 Text |
结构化输出 + 图像 Structured output + image |
设计自动化、工程 Design automation, engineering |
|
文本转 SQL Text-to-SQL |
文本 Text |
SQL 查询 SQL query |
分析、商业智能、数据搜索 Analytics, BI, data search |
|
文本转代码 Text-to-code |
文本 Text |
代码片段 Code snippet |
开发、自动化 Development, automation |
Table 2.2: Multimodal systems by output type and use
随着多模态系统的不断发展,我们可以预见跨越多种输出类别的混合模型将会出现。例如,读取文档(包含图像的PDF)、检索数据库上下文并生成代码片段或SQL查询的系统不再是设想,它已经在企业级人工智能技术栈中得到开发。
As multimodal systems continue to evolve, we can expect hybrid models that span multiple output classes. For instance, a system that reads a document (PDF with images), retrieves database context, and produces a code snippet or SQL query is no longer hypothetical, it is already under development in enterprise AI stacks.
因此,这些系统的设计者必须将输出类型作为主要设计轴,使其与领域需求、用户体验目标和基础设施能力保持一致。
Designers of these systems must therefore consider output type as a primary design axis, aligning it with domain needs, user experience goals, and infrastructure capabilities.
本章探讨了构建高效多模态全人类人工智能(GenAI)系统的核心架构、分类和设计选择。我们区分了向量模型(VLM)与更广泛的多模态全人类人工智能流水线,考察了它们的输出类型(从文本到图像、图像到文本、文本到SQL和代码),并分析了使用Qdrant等向量数据库的实现策略。从单一集合到检索编排,每种设计都会影响系统的可扩展性、性能和准确性。通过基于输出类型对系统进行分类,并将其与用例需求相匹配,我们可以清晰地了解何时应该采用专用模型,何时应该采用模块化、检索增强型架构。这种理解为设计可扩展、准确且高效的多模态人工智能应用奠定了基础。
In this chapter, we explored the architecture, classifications, and design choices central to building effective multimodal GenAI systems. We differentiated VLMs from broader Multimodal GenAI pipelines, examined their outputs, from text-to-image and image-to-text to text-to-SQL and code, and analyzed implementation strategies using vector databases like Qdrant. Each design, from single collections to retrieval orchestration, impacts scalability, performance, and accuracy. By classifying systems based on output type and aligning them with use case requirements, we gain clarity on when to adopt specialized models versus modular, retrieval-augmented architectures. This understanding forms the foundation for designing scalable, accurate, and efficient multimodal AI applications.
下一章将介绍如何使用本地LLM设计和实现完全离线的GenAI系统。本章将重点关注隐私优先和成本效益高的部署方式,指导您使用Ollama、ChromaDB、FAISS和LangChain等工具构建RAG管道,所有工具均在本地运行,无需依赖云API。
In the next chapter, you will learn how to design and implement a fully offline GenAI system using local LLMs. Focusing on privacy-first and cost-efficient deployments, the chapter guides you through building a RAG pipeline using tools like Ollama, ChromaDB, FAISS, and LangChain, all running locally without reliance on cloud APIs.
你将使用 Python 嵌入文档、构建检索器并集成 LLM 进行质量保证。最终,你将开发出一个安全、可定制的基于文档的质量保证机器人,该机器人能够完全离线运行,并拥有对数据和计算资源的完全控制权。
You will embed documents, build a retriever, and integrate an LLM for QA using Python. By the end, you will have developed a secure, customizable document-based QA bot capable of operating entirely offline with complete control over data and compute resources.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
本章我们将着手构建一个基于本地大型语言模型(LLM )的检索增强生成(RAG )系统,该系统完全离线运行,无需依赖任何云端应用程序编程接口(API )。对于重视隐私、数据主权或预算受限的组织而言,这种方法至关重要。您将学习如何搭建一个安全且私密的生成式人工智能(GenAI )管道,适用于企业级或边缘部署。
In this chapter, we embark on building a retrieval-augmented generation (RAG) system using local large language models (LLMs), completely offline and free from any dependency on cloud-based application programming interfaces (APIs). This approach is essential for organizations prioritizing privacy, data sovereignty, or operating under strict budget constraints. You will learn how to setup a secure and private generative AI (GenAI) pipeline suitable for enterprise or edge deployments.
我们将使用 Ollama 在本地运行功能强大的开源 LLM,确保所有数据都保留在您的计算机上。对于文档嵌入的存储和查询,您可以选择Facebook AI Similarity Search ( Faiss ) 或 Chroma,两者都针对快速高效的相似性搜索进行了优化。检索过程将由 LangChain 管理,这是一个强大的编排框架,集成了 LLM、向量存储和自定义逻辑。LangChain 将处理从将用户查询转换为向量表示到获取相关文档以及向 LLM 提供上下文输入的所有操作。
We will use Ollama to run powerful open-source LLMs locally, ensuring all data remains on your machine. For storing and querying document embeddings, you will choose between Facebook AI Similarity Search (Faiss) and Chroma, both optimized for fast, efficient similarity search. The retrieval process will be managed by LangChain, a robust orchestration framework that integrates LLMs, vector stores, and custom logic. LangChain will handle everything from converting user queries into vector representations to fetching relevant documents and prompting the LLM with contextual input.
除了动手开发之外,我们还将研究 RAG 系统的故障点,例如文档分块不佳、嵌入质量问题和检索不匹配等,并探讨缓解这些问题的策略。在本章结束时,您将拥有一个功能齐全的私有单模态 RAG 流水线,并对其设计、权衡和局限性有更深入的了解。
In addition to hands-on development, we will also examine the failure points of RAG systems, such as poor document chunking, embedding quality issues, and retrieval mismatches, and explore strategies to mitigate them. By the end of this chapter, you will have a fully functional, private unimodal RAG pipeline and a deeper understanding of its design, trade-offs, and limitations.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在指导您使用本地 LLM 构建一个完全离线、单模态的 RAG 系统。您将学习如何使用 Ollama 运行 LLM,使用 Faiss 或 ChromaDB 存储和搜索文档嵌入,以及使用 LangChain 管理检索和生成工作流程。重点在于创建一个安全、私密且经济高效的 GenAI 流水线,适用于企业或边缘环境。此外,您还将深入了解 RAG 系统中常见的故障点以及如何解决这些故障点,以确保生成更准确、更可靠的 AI 响应。
The objective of this chapter is to guide you through building a fully offline, unimodal RAG system using local LLMs. You will learn to run LLMs with Ollama, store and search document embeddings using Faiss or ChromaDB, and manage the retrieval and generation workflow using LangChain. The focus is on creating a secure, private, and cost-effective GenAI pipeline suitable for enterprise or edge environments. Additionally, you will gain insights into common failure points in RAG systems and how to address them to ensure more accurate and reliable AI-generated responses.
在了解开发 RAG 系统的方法之前,了解图形处理单元( GPU ) 在当今 GenAI 应用中的作用非常重要。
Before we understand the ways of developing a RAG system, as it is important to understand the role graphics processing units (GPUs) play in today’s GenAI applications.
GPU 在加速 LLM 的性能以及在 RAG 系统中嵌入模型方面发挥着至关重要的作用。然而,是否需要 GPU 取决于多种因素,包括模型大小、工作负载需求、延迟要求和系统架构。了解何时需要 GPU 以及何时可以使用 GPU,有助于构建高效且经济的 GenAI 系统,尤其是在离线或资源受限的环境中。
GPUs play a critical role in accelerating the performance of LLMs and embedding models within a RAG system. However, whether or not you need a GPU depends on several factors, including model size, workload demands, latency requirements, and system architecture. Understanding when a GPU is necessary and when it is optional helps in building efficient and cost-effective GenAI systems, especially in offline or resource-constrained environments.
让我们来看一些需要使用GPU的情况:
Let us look at situations that need you to have a GPU:
• 概念验证或小批量使用:对于原型设计、学术探索或使用频率较低的小规模系统,CPU 执行可能就足够了。这可以降低成本和复杂性,使系统更易于部署和维护。
• Proof-of-concept or low-volume use: For prototyping, academic exploration, or small-scale systems with infrequent usage, CPU execution may suffice. This lowers cost and complexity, making the system easier to deploy and maintain.
虽然GPU对于加速GenAI工作负载至关重要,但云部署和本地部署之间的选择会影响成本和控制。在许多实际场景中,使用本地GPU可以提供更经济高效的解决方案。
While GPUs are essential for accelerating GenAI workloads, the choice between cloud and local deployment impacts both cost and control. Using a local GPU can offer a more economical and efficient solution in many real-world scenarios.
低层管理(LLM)可以根据其主要任务和架构设计目标进行大致分类。使用本地GPU配置比依赖云端GPU服务成本更低,尤其是在工作负载可预测、持续或对隐私敏感的场景下。云端GPU提供商,例如亚马逊网络服务(AWS )、谷歌云平台(GCP )和微软Azure,通常按小时或分钟收费,成本可能很高。尤其是对于像A100或H100这样的高端 GPU 而言,这些费用会迅速攀升。对于从事长时间运行的 GenAI 任务(例如文档处理、实时 RAG 系统或 LLM 微调)的团队来说,这些费用会迅速累积。相比之下,投资本地 GPU 工作站虽然前期成本可能较高,但从长远来看可以节省大量成本。一旦硬件购置完毕,本地运行模型的成本主要就仅限于电力和维护费用。
LLMs can be broadly classified based on their primary tasks and architectural design goals. Using a local GPU setup can be significantly more cost-friendly than relying on cloud-based GPU services, particularly in scenarios where workloads are predictable, continuous, or privacy-sensitive. Cloud GPU providers, like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, typically charge by the hour or minute, and costs can escalate quickly, especially for high-end GPUs like the A100 or H100. For teams working on long-running GenAI tasks, such as document processing, real-time RAG systems, or LLM fine-tuning, these charges accumulate rapidly. In contrast, investing in a local GPU workstation, though it may involve a higher upfront cost, can result in substantial savings over time. Once the hardware is paid for, the cost of running models locally is limited mostly to electricity and maintenance.
此外,本地 GPU 能够更好地控制资源利用率和调度,使其更适合持续或迭代开发。在云环境中,用户通常需要等待实例可用、处理会话超时或支付额外的存储和网络费用。本地基础设施消除了这些低效之处,使开发人员能够按需运行进程而无需承担额外成本。这在频繁运行实验、重新训练模型或对大型数据集执行批量推理的环境中尤为有利。例如,嵌入数百个文档或使用量化 LLM 运行本地推理都可以在无需支付云平台按小时计费的情况下完成。
Moreover, local GPUs offer better control over resource utilization and scheduling, making them more efficient for continuous or iterative development. In cloud environments, users often have to wait for instance availability, deal with session timeouts, or manage additional storage and network fees. Local infrastructure eliminates these inefficiencies, allowing developers to run processes on demand without incurring extra costs. This is particularly beneficial in environments where experiments are run frequently, models are retrained, or batch inference is performed on large datasets. For example, embedding hundreds of documents or running local inference with quantized LLMs can be done without the hourly costs that cloud platforms impose.
本地GPU有助于在注重隐私或离线部署中更好地控制成本。许多企业或政府机构都制定了严格的数据治理政策,禁止将敏感数据上传到外部云服务。在内部GPU上运行带有本地LLM的GenAI系统,不仅可以确保数据保留在本地,还可以避免持续增加合规性要求高昂的云部署成本。在这种情况下,使用云GPU可能需要额外的安全层、虚拟私有云(VPC )和专用实例,从而进一步增加成本。本地GPU部署一旦完成,即可为部署高级AI系统提供安全且经济可持续的平台。对于具有稳定需求和长期GenAI目标的组织而言,本地GPU是一项明智的投资,随着时间的推移将带来丰厚的回报。
Local GPUs support better cost control in privacy-focused or offline deployments. Many enterprises or government institutions have strict data governance policies that prohibit uploading sensitive data to external cloud services. Running GenAI systems with local LLMs on in-house GPUs not only ensures data remains on-premises but also avoids the ongoing cost of compliance-heavy cloud setups. In such cases, cloud GPU usage might require additional security layers, virtual private clouds (VPCs), and dedicated instances, further increasing costs. A local GPU setup, once in place, provides both a secure and economically sustainable platform for deploying advanced AI systems. For organizations with consistent needs and long-term GenAI goals, local GPUs represent a smart investment with a high return over time.
在本地运行 LLM 需要平衡硬件容量、软件工具和模型优化技术,以实现快速、可靠的推理,而无需依赖云服务。以下步骤概述了如何选择合适的模型大小、运行时环境和部署策略,从而将高级 AI 功能完全部署在设备端,确保隐私、控制和离线可用性:
Running a LLM locally requires balancing hardware capacity, software tools, and model optimization techniques to achieve fast, reliable inference without relying on cloud services. The following steps outlines how selecting the right model size, runtime environment, and deployment strategy, can bring advanced AI capabilities entirely on-device, ensuring privacy, control, and offline availability:
1. 硬件要求:主要因素是模型尺寸以及您使用的是 CPU 还是 GPU。
1. Hardware requirements: The main factor is model size and whether you use CPU or GPU.
下表比较了在本地运行不同模型规模的LLM所需的硬件要求,包括流畅推理所需的CPU内存和GPU显存。它可以帮助您根据可用的计算资源选择合适的模型规模。
The following table compares hardware requirements for running LLMs locally at different model sizes, showing the CPU RAM and GPU VRAM needed for smooth inference. It helps you choose the right model size based on your available computing resources.
|
模特尺寸 Model size |
CPU内存(量化) CPU RAM (quantized) |
GPU显存(全精度) GPU VRAM (full precision) |
示例模型 Example models |
|
3–7B 3–7B |
8–16 GB 8–16 GB |
6–8 GB 6–8 GB |
米斯特拉尔7B,羊驼2 7B Mistral 7B, Llama 2 7B |
|
13B 13B |
16–24 GB 16–24 GB |
12–16 GB 12–16 GB |
羊驼 2 13B Llama 2 13B |
|
30B+ 30B+ |
32–64 GB 32–64 GB |
24+ GB 24+ GB |
羊驼 2 33B,Mixtral 8x7B Llama 2 33B, Mixtral 8x7B |
Table 3.1: Hardware requirements for running LLMs locally at different model sizes
一个。 仅使用 CPU 即可实现,采用 4-8 位量化(速度较慢但成本较低)。
a. CPU-only is possible with 4-8-bit quantization (slower but cheaper).
b. GPU 大幅提升速度( 7B 型号需配备NVIDIA RTX 3060/4060及以上显卡)。
b. GPU drastically improves speed (NVIDIA RTX 3060/4060 and above for 7B models).
2. 软件:
2. Software:
• 模型运行时:在本地加载和运行 LLM。
• Model runtime: To load and run the LLM locally.
• llama.cpp :一个轻量级的 C++ 运行器。
• llama.cpp: A lightweight C++ runner.
• Ollama :一个简单的本地模型管理器。
• Ollama: A simple local model manager.
• vLLM :高性能 GPU 推理。
• vLLM: A high-performance GPU inference.
• Python 或 API 环境:
• Python or API environment:
o 变形金刚(拥抱脸):用于在本地加载模型。
o Transformers (Hugging Face): To load models locally.
o 加速:优化多 GPU 或混合精度。
o Accelerate: To optimize multi-GPU or mixed precision.
o bitsandbytes :支持量化以减少 RAM 使用。
o bitsandbytes: Quantization support for low RAM use.
3. 模型文件:
3. Model files:
a. 从 Hugging Face 或类似网站下载。
a. Downloaded from Hugging Face or similar.
b. 通常情况下, llama.cpp为.bin或.gguf文件,而 transformers 为 PyTorch .pth文件。
b. Usually .bin or .gguf files for llama.cpp, or PyTorch .pth for transformers.
c. 量化版本(4 位、8 位)使局部推理成为可能。
c. Quantized versions (4-bit, 8-bit) make local inference practical.
4. 部署模式:
4. Deployment Patterns:
a. 基于命令行|在终端中运行以进行快速测试。
a. CLI-based | run in terminal for quick tests.
b. 本地 API 服务器|为其他应用程序(例如 FastAPI、Flask)公开端点。
b. Local API server | expose endpoints for other apps (e.g., FastAPI, Flask).
c. 集成到应用程序中 | 直接从 Python 或 Node.js 脚本调用模型。
c. Integrated in apps | call model directly from Python or Node.js scripts.
5. 性能技巧:
5. Performance tips:
a. 使用量化来缩小模型大小并减少内存需求。
a. Use quantization to shrink model size and reduce memory needs.
b. 为了提高效率,建议选择 Mistral、Llama、Phi 型号。
b. Prefer Mistral, Llama, Phi models for efficiency.
c. 如果不需要,则减少上下文长度(减少标记|加快推理速度)。
c. Reduce context length if not needed (fewer tokens | faster inference).
d. 将模型存储在 SSD 上,以加快加载速度。
d. Store models on SSD for faster load times.
随着我们探索更高效、更私密的 GenAI 部署方案,向本地 RAG 系统的转变变得越来越有吸引力。Ollama、Unsloth 等工具以及轻量级嵌入模型使得完全在本地硬件上构建强大的 RAG 流水线成为可能。我们将实现如图 3.1所示的架构。
As we explore more efficient and private GenAI deployments, the shift toward local RAG systems becomes increasingly attractive. Tools like Ollama, Unsloth, and lightweight embedding models make it practical to build powerful RAG pipelines entirely on local hardware. We will implement the following architecture shown in Figure 3.1.
下图展示了 RAG 系统的架构,RAG 系统是 GenAI 中一种流行的框架,它将基于检索的方法与 LLM 相结合,以提供准确的、上下文感知的答案:
The following figure represents the architecture of a RAG system, a popular framework in GenAI that combines retrieval-based methods with LLMs to provide accurate, context-aware answers:
如第 1 章“新时代生成式人工智能简介”中所述,上图的工作原理如下:
As explained in Chapter 1, Introducing New Age Generative AI, here is how the preceding figure works:
1. 文档处理:首先摄取原始文档,然后将其分割成更小的块,以提高搜索粒度和检索准确性。
1. Document processing: Raw documents are first ingested and then split into smaller chunks to improve search granularity and retrieval accuracy.
2. 嵌入生成:这些块通过嵌入模型(例如 OpenAI 或本地替代方案)进行处理,从而转换为高维向量表示。
2. Embedding generation: These chunks are passed through an embedding model (such as OpenAI or a local alternative), which converts them into high-dimensional vector representations.
|
注意:分块是离线活动;为了简单起见,我们已在流程图中展示了它。 Note: The chunking is an offline activity; to make it simple, we have shown it in the flow. |
3. 向量数据库:生成的向量嵌入存储在向量数据库(例如,Faiss、Chroma)中。该数据库通过比较向量来实现快速相似性搜索。
3. Vector database: The resulting embeddings are stored in a vector database (e.g., Faiss, Chroma). This database enables fast similarity searches by comparing vectors.
4. 用户查询:当用户提交查询时,也会使用相同的嵌入模型将其转换为向量。
4. User query: When a user submits a query, it is also converted into a vector using the same embedding model.
5. 向量搜索:将查询向量与存储的文档向量进行匹配,以检索最相关的块。
5. Vector search: The query vector is matched against the stored document vectors to retrieve the most relevant chunks.
6. LLM 处理:这些检索到的数据块作为上下文发送到 LLM,然后 LLM 生成连贯且知情的响应。
6. LLM processing: These retrieved chunks are sent as context to a LLM, which then generates a coherent and informed response.
7. 结果交付:将最终输出结果返回给用户。
7. Result delivery: The final output is returned to the user.
Ollama 是一款功能强大且易于使用的工具,旨在简化本地运行 LLM 的过程。它提供了一个简洁的界面和运行时环境,用于在您自己的计算机上下载、管理和执行 Llama、Mistral 等模型。只需一条命令,您就可以拉取预配置的模型,并在安全的环境中开始与它们交互。支持离线设置。Ollama 可处理后端优化,包括模型量化(例如 4 位和 8 位)、高效的内存使用以及硬件加速(GPU/CPU),使其成为开发者、研究人员和企业构建私有 GenAI 应用的理想选择。
Ollama is a powerful, yet user-friendly tool designed to simplify the process of running LLMs locally. It provides a clean interface and runtime environment for downloading, managing, and executing models such as Llama, Mistral, and others on your own machine. With just a single command, you can pull pre-configured models and start interacting with them in a secure, offline setting. Ollama handles backend optimizations, including model quantization (e.g., 4-bit and 8-bit), efficient memory usage, and hardware acceleration (GPU/CPU), making it ideal for developers, researchers, and enterprises seeking to build private GenAI applications.
Ollama 的主要优势之一在于其对隐私和简洁性的重视。由于所有操作都在本地运行,数据不会离开您的计算机,因此非常适合敏感或受监管的环境。它还支持与 LangChain 等框架集成,使其成为 RAG 流水线和其他 GenAI 工作流程的理想选择。
One of Ollama’s key advantages is its focus on privacy and simplicity. Since everything runs locally, no data leaves your machine, making it suitable for sensitive or regulated environments. It also supports integration with frameworks like LangChain, making it an excellent choice for RAG pipelines and other GenAI workflows.
其他一些工具和框架也提供类似的本地LLM功能,详情如下:
Several other tools and frameworks provide similar local LLM capabilities, details as follows:
这些工具各自针对略有不同的使用场景,但它们的共同目标是在不依赖云 API 的情况下,普及对强大的 LLM 的访问。
Each of these tools caters to slightly different use cases, but all share the goal of democratizing access to powerful LLMs without relying on cloud APIs.
让我们通过以下步骤,在您的本地计算机上安装 Ollama 并运行 Ollama 服务器:
Let us go through a step-by-step guide to install Ollama on your local machine and run the Ollama server:
1. 检查系统要求:安装前,请确保您已满足以下条件:
1. Check system requirements: Before installation, ensure you have:
2. 安装 Ollama:
2. Install Ollama:
brew install ollama
b. 在 Linux 系统上:
curl -fsSL https://ollama.com/install.sh | sh
brew install ollama
b. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
这将安装 Ollama CLI 并设置环境。
This will install the Ollama CLI and setup the environment.
3. 在 Windows系统上(通过 WSL2):
3. On Windows (via WSL2):
curl -fsSL https://ollama.com/install.sh | sh
curl -fsSL https://ollama.com/install.sh | sh
4. 启动 Ollama服务器:安装完成后,启动 Ollama 服务器:
4. Start the Ollama server: Once installed, start the Ollama server:
羊驼服务。
ollama serve.
这会在后台运行 Ollama 服务器,准备加载和运行模型。
This runs the Ollama server in the background, ready to load and run models.
5. 运行模型(例如 Llama 3 或 Minstral):下载模型并开始与模型聊天:
5. Run a model (e.g., Llama 3 or Minstral): To download and start chatting with a model:
羊驼跑羊驼3
ollama run llama3
这将实现以下功能:
This will do the following:
6. 可选步骤:将 Ollama 与 LangChain 或 API 配合使用。Ollama 默认公开一个本地 HTTP API,地址为:
http://localhost:11434
现在,您可以使用 REST 或 LangChain 等库将 Ollama 集成到应用程序中,以实现本地 RAG 管道。
如果您是 Mac 用户,您会看到 Ollama:
您还可以使用以下命令列出所有LLM:
奥拉玛名单
上图所示的两个模型是:
6. Optional step: Use Ollama with LangChain or API. Ollama exposes a local HTTP API by default at:
http://localhost:11434
You can now integrate Ollama into applications using REST or libraries like LangChain for local RAG pipelines.
If you are a Mac user, you will see Ollama:
You can also list all LLMs using the command:
Ollama list
Figure 3.3: The image shows a terminal output listing locally installed model in Ollama
The two models shown in the preceding figure are:
这两个模型均在四周前进行了修改,表明近期进行了设置或更新。这证实了本地环境已准备就绪,可以使用 Ollama CLI 或 API 运行这些模型进行离线 LLM 推理。
Both models were modified four weeks ago, indicating recent setup or updates. This confirms that the local environment is prepared to run these models using the Ollama CLI or API for offline LLM inference.
现在你已经了解了 Ollama,让我们用它来生成一个 PDF 文档,我们稍后会在我们的 GenAI 系统中使用它。
Now that you understand Ollama, let us use it to generate a PDF document, which we will later use in our GenAI system.
以下是一个完整的 Python 脚本:
Here is an end-to-end Python script that:
先决条件是:
The prerequisites are:
奥拉玛服务
ollama serve
ollama 运行 llama3.2:3b-instruct-fp16
ollama run llama3.2:3b-instruct-fp16
pip install requests reportlab
pip install requests reportlab
具备了所有前提条件后,我们现在可以编写一个 Python 脚本,将所有内容连接起来。
With the prerequisites in place, we can now write a Python script that ties everything together.
此脚本将执行以下操作:
This script will:
以下是完整的代码:
The following is the complete code:
导入请求
import requests
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.pagesizes import LETTER
从 reportlab.pdfgen 导入 canvas
from reportlab.pdfgen import canvas
导入 textwrap
import textwrap
OLLAMA_URL = “http://localhost:11434/api/generate”
OLLAMA_URL = “http://localhost:11434/api/generate”
MODEL_NAME = "llama3.2:3b-instruct-fp16"
MODEL_NAME = "llama3.2:3b-instruct-fp16"
def generate_text(topic, max_words=600):
def generate_text(topic, max_words=600):
提示 = (
prompt = (
请撰写一篇关于“{topic}”的说明性文章,字数约为{max_words}。
f"Write an informative article about '{topic}' with approximately {max_words} words. "
f“文章结构应包括引言、正文和结论。”
f"Structure the article with an introduction, body, and conclusion."
)
)
response = requests.post(OLLAMA_URL, json={
response = requests.post(OLLAMA_URL, json={
“模型”: MODEL_NAME,
"model": MODEL_NAME,
“提示”:提示,
"prompt": prompt,
“流”:否
"stream": False
})
})
如果 response.status_code == 200:
if response.status_code == 200:
返回 response.json()["response"].strip()
return response.json()["response"].strip()
别的:
else:
raise Exception(f"错误:{response.status_code} - {response.text}")
raise Exception(f"Error: {response.status_code} - {response.text}")
def save_to_pdf(text, filename):
def save_to_pdf(text, filename):
pdf = canvas.Canvas(filename, pagesize=LETTER)
pdf = canvas.Canvas(filename, pagesize=LETTER)
宽度,高度 = 字母
width, height = LETTER
边距 = 50
margin = 50
text_object = pdf.beginText(margin, height - margin)
text_object = pdf.beginText(margin, height - margin)
text_object.setFont("Times-Roman", 12)
text_object.setFont("Times-Roman", 12)
wrapped_lines = []
wrapped_lines = []
for paragraph in text.split("\n"):
for paragraph in text.split("\n"):
wrapped_lines.extend(textwrap.wrap(paragraph, width=90))
wrapped_lines.extend(textwrap.wrap(paragraph, width=90))
wrapped_lines.append("")
wrapped_lines.append("")
对于 wrapped_lines 中的每个行:
for line in wrapped_lines:
text_object.textLine(line)
text_object.textLine(line)
如果 text_object.getY() < margin:
if text_object.getY() < margin:
pdf.drawText(text_object)
pdf.drawText(text_object)
pdf.showPage()
pdf.showPage()
text_object = pdf.beginText(margin, height - margin)
text_object = pdf.beginText(margin, height - margin)
text_object.setFont("Times-Roman", 12)
text_object.setFont("Times-Roman", 12)
pdf.drawText(text_object)
pdf.drawText(text_object)
pdf.save()
pdf.save()
如果 __name__ == "__main__":
if __name__ == "__main__":
主题 = “人工智能在现代教育中的作用”
topic = "The Role of Artificial Intelligence in Modern Education"
尝试:
try:
print(f"正在生成关于 {topic} 的文章")
print(f"Generating article on: {topic}")
文章 = generate_text(主题)
article = generate_text(topic)
save_to_pdf(article, "ai_education_article.pdf")
save_to_pdf(article, "ai_education_article.pdf")
print("PDF 生成成功:ai_education_article.pdf")
print("PDF generated successfully: ai_education_article.pdf")
除异常 e 外:
except Exception as e:
print(str(e))
print(str(e))
输出:脚本将创建一个名为ai_education_article.pdf的 PDF 文件:
Output: The script will create a PDF file named: ai_education_article.pdf:
Figure 3.4: The figure confirms successful execution of the script
这表明,使用本地 Ollama LLM 生成了一篇关于“人工智能在现代教育中的作用”的文章,并将输出保存为名为ai_education_article.pdf的 PDF 文件。这说明本地模型运行正常,文档生成过程中未出现任何错误。现在您可以打开 PDF 文件并查看生成的内容。
It shows that an article on the topic The Role of Artificial Intelligence in Modern Education was generated using the local Ollama LLM, and the output was saved as a PDF file named ai_education_article.pdf. This indicates that the local model ran as expected and the document was created without any errors. You are now ready to open the PDF and review the generated content.
本书的GitHub仓库中分享了一个更新后的脚本,该脚本可以生成多个基于主题的PDF文件:
An updated script that generates multiple topic-based PDFs is shared with this book’s GitHub repository:
Figure 3.5: This figure shows what all synthetic articles are generated after running the script
现在您已经学会了如何自动生成文档,我们将使用之前创建的ai_education_article.pdf文件来构建一个 RAG 系统。该系统将包含以下组件:
Now, that you have learned how to automatically generate documents, we will take the previously created ai_education_article.pdf and use it to build a RAG system. This system will include the following components:
图 3.6展示了一个结构化的布局,它体现了一个清晰且可扩展的 RAG 流水线。每个文件夹代表一个关键组件,涵盖了数据摄取(source_docs/ )、嵌入逻辑(embeddings/ )、向量存储(vectorstore/ )、检索策略(retriever/ )、生成逻辑(llm/ )以及基于 LangChain 的编排(orchestrator/ )。这种模块化设计便于定制、调试和维护。用于 PDF 解析和引文跟踪的实用脚本进一步增强了功能,而内存管理则确保了多轮交互的一致性。这种设计不仅支持使用 Mistral 和 Ollama 等本地模型进行离线部署,而且还鼓励在各种 RAG 用例中实现重用和扩展。详情如下:
Figure 3.6 shows a structured layout that exemplifies a clean and scalable RAG pipeline. Each folder represents a critical component, ranging from data ingestion (source_docs/), embedding logic (embeddings/), and vector storage (vectorstore/), to retrieval strategies (retriever/), generation logic (llm/), and LangChain-based orchestration (orchestrator/). The modularity allows for easy customization, debugging, and maintenance. Utility scripts for PDF parsing and citation tracking further enhance functionality, while memory management ensures coherent multi-turn interactions. Such a design not only supports offline deployment using local models like Mistral and Ollama but also encourages reusability and extension across varied RAG use cases. The details are as follows:
下图展示了一个模块化 RAG 系统结构清晰的目录结构。它包含了文档导入、嵌入、混合检索、LLM 交互(通过 Ollama)、内存管理、来源引用以及通过 LangChain 进行编排等组件。每个模块都清晰分离,以支持可扩展性、可重用性以及开发和部署的清晰度。
The following figure illustrates a well-organized directory structure for a modular RAG system. It includes components for document ingestion, embedding, hybrid retrieval, LLM interaction (via Ollama), memory handling, source citation, and orchestration through LangChain. Each module is clearly separated to support scalability, reusability, and clarity in development and deployment.
以下代码导入了您将需要的所有 LangChain 和 Python 工具:
The following code import all the tools from LangChain and Python that you will need for:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitterC import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers.hybrid import HybridRetriever
from langchain_community.vectorstores import Chroma
from langchain.retrievers import BM25Retriever
导入操作系统
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitterC import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers.hybrid import HybridRetriever
from langchain_community.vectorstores import Chroma
from langchain.retrievers import BM25Retriever
import os
为了启动 RAG 流程,首先使用 LangChain 的PyPDFLoader和RecursiveCharacterTextSplitter将 PDF 文档加载并分割成易于管理的文本块。这一步骤至关重要,它可以将长文档分割成重叠的文本块,在保留上下文的同时确保更精细的检索粒度。通过指定大小和重叠度进行分块,下游的嵌入和检索系统可以高效运行,而不会丢失叙述流程。正如代码和以下列表所述,这些文本块构成了向量化和搜索的基础,从而实现细粒度的语义查找。合理的分块确保系统在用户交互过程中能够提供准确、相关且上下文连贯的信息。
To begin the RAG pipeline, the PDF document is loaded and segmented into manageable text chunks using LangChain’s PyPDFLoader and RecursiveCharacterTextSplitter. This step is essential for breaking long documents into overlapping text blocks, preserving context while ensuring better retrieval granularity. Chunking with a specified size and overlap allows downstream embedding and retrieval systems to work efficiently without losing the narrative flow. These chunks, as explained in the code and the following list form the foundation of vectorization and search, enabling fine-grained semantic lookup. Proper chunking ensures the system responds with accurate, relevant, and contextually coherent information during user interactions.
loader = PyPDFLoader("data/source_docs/ai_education_article.pdf")
loader = PyPDFLoader("data/source_docs/ai_education_article.pdf")
documents = loader.load()
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_size=500,
chunk_overlap=50,
chunk_overlap=50,
分隔符=["\n\n", "\n", " ", ""]
separators=["\n\n", "\n", " ", ""]
)
)
chunks = text_splitter.split_documents(documents)
chunks = text_splitter.split_documents(documents)
虽然RecursiveCharacterTextSplitter是 LangChain 中默认且最灵活的分块策略,但您还可以根据文档类型、结构或用例使用其他几种分块方法。
While RecursiveCharacterTextSplitter is the default and most flexible chunking strategy in LangChain, there are several other chunking methods you can use based on your document type, structure, or use case.
LangChain 除了默认的递归方法外,还支持多种文本分割策略。根据文档的结构、语言或领域,您可以选择CharacterTextSplitter 、TokenTextSplitter 、SentenceTransformersTextSplitter 、NLTKTextSplitter或SpacyTextSplitter等分割器。每种分割器都各有优势——有些保留语义边界,有些针对 LLM 标记限制进行优化,还有一些可以处理 Markdown 等结构化格式。选择合适的分割器对于保持内容一致性和优化嵌入质量至关重要,尤其对于问答、摘要或检索等应用而言。这种模块化设计使得在 RAG 流程中能够精确控制文档准备工作:
LangChain supports multiple text splitting strategies beyond the default recursive method. Depending on the structure, language, or domain of your documents, you can choose splitters like CharacterTextSplitter, TokenTextSplitter, SentenceTransformersTextSplitter, NLTKTextSplitter, or SpacyTextSplitter. Each offers unique benefits—some preserve semantic boundaries, others optimize for LLM token limits, and a few handle structured formats like Markdown. Selecting the right splitter is crucial for maintaining content coherence and optimizing embedding quality, especially for applications like question answering, summarization, or retrieval. This modularity enables precise control over document preparation in a RAG pipeline:
from langchain.text_splitter import CharacterTextSplitter
使用场景:当你需要大小一致的块,并且不介意粗略的分割时。
from langchain.text_splitter import CharacterTextSplitter
Use case: When you need consistently sized chunks and do not mind rough breaks.
from langchain.text_splitter import TokenTextSplitter
使用场景:当使用像 GPT 或 Mistral 这样的基于令牌的模型时。
from langchain.text_splitter import TokenTextSplitter
Use case: When working with token-limited models like GPT or Mistral.
from langchain.text_splitter import SentenceTransformersTextSplitter
使用场景:当您需要语义上有意义的块时(尤其用于问答或摘要)。
from langchain.text_splitter import SentenceTransformersTextSplitter
Use case: When you want semantically meaningful chunks (especially for QA or summarization).
from langchain.text_splitter import NLTKTextSplitter
o 使用自然语言工具包( NLTK ) 将文本拆分成句子。
使用场景:无需手动逻辑即可实现基于句子的清晰分块。
from langchain.text_splitter import NLTKTextSplitter
o Uses the Natural Language Toolkit (NLTK) to split text into sentences.
Use case: Clean sentence-based chunking without manual logic.
from langchain.text_splitter import SpacyTextSplitter
使用场景:当您需要在多种语言中进行语言学上准确的拆分时。
from langchain.text_splitter import SpacyTextSplitter
Use case: When you want linguistically accurate splitting in multiple languages.
from langchain.text_splitter import MarkdownHeaderTextSplitter
使用场景:用于文档、博客或 README 风格的内容,其中标题指示主题更改。
from langchain.text_splitter import MarkdownHeaderTextSplitter
Use case: For documentation, blogs, or README-style content where headers indicate topic changes.
分块后,每个文本片段都通过OllamaEmbeddings等嵌入模型(例如 Mistral)转换为高维向量。这些嵌入以数值形式表示语义含义,从而实现高效的相似性搜索。本地向量数据库 Chroma 存储这些向量以及文档来源等元数据,从而实现可追溯的检索。将这些信息持久化到数据库目录中,无需重新嵌入即可重复使用。元数据增强了下游任务(例如按来源或时间过滤)的性能。如下面的代码和列表所示,此步骤将非结构化文本转换为结构化的、可查询的内存,使其成为在注重隐私的离线部署中实现智能文档检索的基础:
After chunking, each text segment is converted into high-dimensional vectors using embedding models like Mistral via OllamaEmbeddings. These embeddings numerically represent semantic meaning, allowing efficient similarity search. Chroma, a local vector database, stores these vectors along with metadata such as document source, enabling traceable retrieval. Persisting this information in the db directory allows reuse without re-embedding. Metadata enhances downstream tasks like filtering by source or time. This step as shown in the following code and list transforms unstructured text into structured, queryable memory, making it foundational for intelligent document retrieval in privacy-first, offline deployments:
embedding_model = OllamaEmbeddings(model="mistral")
embedding_model = OllamaEmbeddings(model="mistral")
vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db")
vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db")
|
注意:如果要重用现有索引,请将上一行替换为: 如果 os.path.exists("db/index.sqlite3"): vectorstore = Chroma(persist_directory="db", embedding_function=embedding_model) 别的: vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db") |
|
Note: If you want to reuse an existing index, replace the preceding line with: if os.path.exists("db/index.sqlite3"): vectorstore = Chroma(persist_directory="db", embedding_function=embedding_model) else: vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory="db") |
以下是一些可与OllamaEmbeddings一起使用的常用嵌入模型,您可以选择最适合您的 RAG 管道需求的模型:
Here are some popular embedding models you can use with OllamaEmbeddings, so you can choose the one that best fits your RAG pipeline needs:
以下是一个简单的示例,展示如何在不同的 Ollama 嵌入模型之间切换:
The following is a quick example showing how to switch between different Ollama embedding models:
from langchain.embeddings import OllamaEmbeddings
from langchain.embeddings import OllamaEmbeddings
# 请选择以下型号之一
# Choose one of the models below
model_name = "mxbai-embed-large"
model_name = "mxbai-embed-large"
# model_name = "nomic-embed-text"
# model_name = "nomic-embed-text"
# model_name = "all-minilm"
# model_name = "all-minilm"
# model_name = "bge-m3"
# model_name = "bge-m3"
embeddings = OllamaEmbeddings(model=model_name)
embeddings = OllamaEmbeddings(model=model_name)
vectors = embeddings.embed_documents(["要嵌入的示例文本"])
vectors = embeddings.embed_documents(["Sample text to embed"])
print(vectors[0][:5]) # 预览前 5 个维度
print(vectors[0][:5]) # preview of first 5 dimensions
混合检索结合了基于关键词和语义的搜索的优势。最佳匹配25 ( BM25 ) 处理精确的关键词匹配,适用于专有名词和罕见词,而向量搜索则检索上下文相似的内容。LangChain 的HybridRetriever融合了这两种方法,通过同时关注句法和语义相关性来提高准确率和召回率。这种双重方法确保了对各种查询类型的鲁棒性,尤其是在涉及歧义或探索性问题的场景中。通过配置两种检索器的k 值(最佳结果数),可以对搜索行为进行微调,这使得混合检索成为现代高性能 RAG 流水线的重要组成部分。
Hybrid retrieval combines the strengths of both keyword-based and semantic search. Best Matching 25 (BM25) handles exact keyword matches, useful for proper nouns and rare terms, while vector search retrieves contextually similar content. LangChain’s HybridRetriever fuses both methods, increasing accuracy and recall by addressing both syntactic and semantic relevance. This dual approach ensures robustness across diverse query types, especially in scenarios involving ambiguous or exploratory questions. Configuring k (top results) for both retrievers allows fine-tuning of search behavior, making hybrid retrieval an essential component of modern, high-performance RAG pipelines.
BM25 是一种用于信息检索的排序函数,用于评估文档与搜索查询的相关性。它是概率检索模型的一部分,通过考虑词频(单词在文档中出现的频率)、逆文档频率(单词在所有文档中的出现频率)和文档长度归一化,改进了早期模型。BM25 会为查询词频繁出现且在整个语料库中出现频率较低的文档赋予更高的分数,同时对文档长度进行调整。由于其高效性和简洁性,BM25 被广泛应用于搜索引擎和现代检索系统中。
BM25 is a ranking function used in information retrieval to estimate the relevance of documents to a search query. It is part of the probabilistic retrieval model and improves upon earlier models by considering term frequency (how often a word appears in a document), inverse document frequency (how rare a word is across all documents), and document length normalization. BM25 assigns higher scores to documents where query terms appear frequently and are rare in the overall corpus, while adjusting for document length. It is widely used in search engines and modern retrieval systems due to its effectiveness and simplicity:
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
bm25_retriever.k = 4
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
hybrid_retriever = HybridRetriever(vectorstore=vectorstore, bm25_retriever=bm25_retriever)
hybrid_retriever = HybridRetriever(vectorstore=vectorstore, bm25_retriever=bm25_retriever)
除了适用于传统关键词搜索的BM25Retriever之外,LangChain 还支持其他几种检索器,您可以根据 RAG 系统的需求选择使用。接下来,我们将讨论一些实用的检索器及其最佳适用场景。
Aside from BM25Retriever, which is great for traditional keyword-based search, LangChain supports several other retrievers that can be used depending on your RAG system’s needs. Let us discuss a list of useful retrievers and what they are best suited for.
LangChain 提供超越基本向量和关键词搜索的高级检索策略。诸如ContextualCompressionRetriever 、MultiQueryRetriever 、SelfQueryRetriever和TimeWeightedVectorStoreRetriever等工具支持摘要、查询多样化和时间加权检索。感知排序。其他检索器,例如ParentDocumentRetriever和EnsembleRetriever,则针对检索器之间的一致性和加权策略进行优化。每个检索器都针对特定的问题,例如长文档、模糊查询、元数据过滤或时间优先级。通过根据用例需求组合或替换检索器,您可以显著提高 RAG 系统的相关性、灵活性和性能,尤其是在复杂多变的聊天或企业知识库环境中,详情如下:
LangChain offers advanced retrieval strategies beyond basic vector and keyword search. Tools like ContextualCompressionRetriever, MultiQueryRetriever, SelfQueryRetriever, and TimeWeightedVectorStoreRetriever enable summarization, query diversification, and time-aware ranking. Others like ParentDocumentRetriever and EnsembleRetriever optimize for coherence and weighted strategies across retrievers. Each retriever targets a unique problem, lengthy documents, vague queries, metadata filtering, or temporal priority. By combining or swapping retrievers based on use case needs, you can greatly enhance your RAG system’s relevance, flexibility, and performance, particularly in complex, evolving chat or enterprise knowledge environments, details as follows:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ParentDocumentRetriever
from langchain.retrievers import ParentDocumentRetriever
from langchain.retrievers import SelfQueryRetriever
from langchain.retrievers import SelfQueryRetriever
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import EnsembleRetriever
为了在多轮对话中保持上下文关联,LangChain 引入了ConversationBufferMemory 。该内存模块存储完整的聊天历史记录,使语言模型能够有效地处理后续问题并引用之前的查询。它确保回复不仅基于当前问题,还基于之前的互动,从而提高连贯性和用户满意度。这对于聊天机器人和助手来说尤为重要,因为连贯性对它们至关重要。启用return_messages=True 后,用户和 AI 的消息都会被保留,使 RAG 系统能够在不丢失对话状态的情况下维持丰富的、持续的对话。
To maintain context across multi-turn conversations, LangChain introduces ConversationBufferMemory. This memory module stores the full chat history, enabling the language model to handle follow-ups and reference earlier queries effectively. It ensures that responses are grounded not only in the current question but also in prior interactions, improving coherence and user satisfaction. This is especially valuable in chatbots and assistants, where continuity is essential. With return_messages=True, both user and AI messages are preserved, making the RAG system capable of sustaining rich, ongoing dialogues without losing conversational state.
ConversationBufferMemory是 LangChain 中的一个内存类,它将整个对话历史记录存储为字符串缓冲区。它允许语言模型 (LLM) 记住聊天会话中的先前交互,帮助模型在回合之间保持上下文,详情如下:
ConversationBufferMemory is a memory class in LangChain that stores the entire conversation history as a string buffer. It allows a language model (LLM) to remember prior interactions in a chat session, helping the model maintain context across turns, details as follows:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
RAG 系统使用 Ollama 接口加载本地 Mistral 语言模型,并通过温度等参数控制输出行为。较低的温度值(例如0.2 )可生成确定性强、重点突出的响应,非常适合事实性问答或检索任务。由于系统在本地运行,因此可确保隐私和成本效益。该语言模型能够解读检索到的内容和用户问题,并生成结构化且信息丰富的回复。以下模块化配置允许在不同模型之间轻松切换,从而根据您的应用程序需求调整输出风格、速度和准确性,使其成为离线 GenAI 部署的基石:
The RAG system uses the Ollama interface to load the local Mistral language model, controlling output behavior via parameters like temperature. A low temperature value (e.g., 0.2) results in deterministic, focused responses ideal for factual QA or retrieval tasks. By running locally, the setup ensures privacy and cost-efficiency. This language model interprets retrieved content and user questions, generating structured and informative replies. The following modular configuration allows easy switching between models, aligning output style, speed, and accuracy with your application needs—making it a cornerstone of offline, GenAI deployments:
llm = Ollama(model=”mistral”, temperature=0.2)
llm = Ollama(model=”mistral”, temperature=0.2)
ReAct 是一种提示策略,它引导学习导师在回答问题之前将问题分解为逻辑步骤。提示模板提供了一种结构:它将推理过程与最终答案分开,从而提高了答案的透明度和可追溯性。通过明确指导该模型采用逐步思考的方式,使LLM的输出与类人问题解决过程保持一致。这种方法通过鼓励模型对检索到的上下文进行有意义的综合,从而提升了知识密集型任务(例如检索增强型问答)的性能。ReAct模板能够使RAG系统中的AI行为更具可解释性、可控性和可信度,详情如下:
ReAct is a prompting strategy that guides LLMs to break down problems into logical steps before answering. The prompt template provides structure: it separates reasoning from the final response, improving transparency and traceability of answers. By explicitly instructing the model to think step-by-step, it aligns LLM output with human-like problem-solving processes. This method boosts performance in knowledge-intensive tasks like retrieval-augmented QA by encouraging the model to synthesize retrieved context meaningfully. ReAct templates enable more explainable, controllable, and trustworthy AI behaviors in RAG systems, details as follows:
react_prompt = PromptTemplate(
react_prompt = PromptTemplate(
input_variables=["context", "question"],
input_variables=["context", "question"],
模板="\"\"
template=\"\"\"
您是一位使用 ReAct(推理 + 行动)技术的智能助手。
You are an intelligent assistant using the ReAct (Reasoning + Acting) technique.
将用户查询分解为推理步骤,并据此检索相关信息。
Break down the user query into reasoning steps and retrieve relevant information accordingly.
问题:{question}
Question: {question}
相关背景:
Relevant Context:
{语境}
{context}
首先,请清晰地列出你的推理步骤。
First, list your reasoning steps clearly.
然后,根据这些步骤和检索到的上下文提供最终答案。
Then, provide a final answer based on those steps and the retrieved context.
推理步骤:
Reasoning Steps:
1.
1.
\"\"\"
\"\"\"
)
)
• 鼓励在得出最终答案之前进行逻辑推理。
• Encourages logical reasoning before generating the final answer.
对话检索链整合了LLM、检索器、内存和提示等核心组件,形成完整的RAG工作流程。它通过保留历史记录支持多轮对话,利用混合搜索检索上下文,使用ReAct提示进行推理,并通过Mistral模型进行响应。这条统一的流程链不仅生成高质量的答案,还能返回所使用的源文档,从而增强透明度和引用性。它是智能助手和文档聊天系统的基石,能够实现动态的、上下文感知的响应。这种抽象简化了LLM驱动应用程序的编排,并鼓励采用模块化、可扩展的设计。
The ConversationalRetrievalChain integrates the core components, LLM, retriever, memory, and prompt, to form a complete RAG workflow. It supports multi-turn dialogue by preserving history, retrieves context with hybrid search, reasons with the ReAct prompt, and responds via the Mistral model. This unified chain not only generates high-quality answers but also returns the source documents used, enhancing transparency and citation. It is the backbone of intelligent assistants and document chat systems, enabling dynamic, context-aware responses. This abstraction simplifies orchestration and encourages modular, scalable design in LLM-powered applications:
qa_chain = ConversationalRetrievalChain.from_llm(
qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
llm=llm,
检索者=杂交寻回犬,
retriever=hybrid_retriever,
内存=内存,
memory=memory,
return_source_documents=True,
return_source_documents=True,
combine_docs_chain_kwargs={"prompt": react_prompt}
combine_docs_chain_kwargs={"prompt": react_prompt}
)
)
它结合了以下要素:
It combines the following:
• llm (米斯特拉尔)
• The llm (Mistral)
• 检索器(混合搜索)
• The retriever (hybrid search)
• 对话记忆
• The conversation memory
• 结构化推理的提示
• The prompt for structured reasoning
• 返回答案和源文档(用于引用)
• Returns both the answer and the source documents (for citation)
最后一个组件是聊天循环,RAG 系统在此循环中持续接收、处理和响应用户输入。该循环捕获用户问题,将其传递到对话式问答链中,并显示答案和来源引用。它支持实时交互和多轮记忆,使其成为聊天机器人、研究助手或文档问答工具的理想选择。通过整合之前的所有组件(检索、生成、记忆和提示),聊天循环使系统充满活力,将静态文档转化为面向最终用户的交互式知识界面:
The final component is the chat loop, where user input is continuously accepted, processed, and responded to by the RAG system. This loop captures user questions, passes them through the conversational QA chain, and displays both the answer and source citations. It supports real-time interaction and multi-turn memory, making it ideal for chatbots, research assistants, or document QA tools. By integrating all prior components, retrieval, generation, memory, and prompting, the chat loop brings the system to life, turning static documents into an interactive knowledge interface for end users:
print("RAG 系统已准备就绪。请就文档提出问题。")
print("RAG System Ready. Ask a question about the document.")
当 True 时:
while True:
查询 = 输入("\n用户: ")
query = input("\nUser: ")
如果 query.lower() 在 ["exit", "quit"] 中:
if query.lower() in ["exit", "quit"]:
休息
break
response = qa_chain({"question": query})
response = qa_chain({"question": query})
print("\n助理:", response["answer"])
print("\nAssistant:", response["answer"])
print("\n来源:")
print("\nSources:")
for doc in response["source_documents"]:
for doc in response["source_documents"]:
print("-", doc.metadata.get("source", "[不含源元数据的块]"))
print("-", doc.metadata.get("source", "[Chunk without source metadata]"))
在前一节中,我们提到了提示信息长度溢出的问题。当提示信息的总长度(包括用户查询、上下文、系统指令和内存)超过语言模型的最大词元容量限制时,就会发生提示信息长度溢出。每个模型(例如 Mistral、Llama 或 GPT)都有一个定义的词元容量(例如 4,096 或 8,000 个词元),超过此限制会导致错误或响应被截断。溢出通常发生在 RAG 系统中,尤其是在提示信息过多时。 单个提示信息中可能包含大量数据块或冗长的对话。为避免这种情况,您可以限制数据块大小、截断旧内存,或使用基于标记的文本分割器和压缩检索器,以确保输入内容在安全范围内。
In the preceding section, we touched upon a challenge called prompt size overflow, it occurs when the combined length of a prompt, including the user query, context, system instructions, and memory, exceeds the maximum token limit of the language model. Each model (like Mistral, Llama, or GPT) has a defined token capacity (e.g., 4,096 or 8,000 tokens), and exceeding this limit causes errors or truncated responses. Overflow often happens in RAG systems when too many large chunks or long conversations are included in a single prompt. To prevent it, you can limit chunk size, truncate older memory, or use token-aware text splitters and compression retrievers to keep input within safe bounds.
正如提示信息溢出会影响 RAG 系统的性能一样,还有其他一些挑战和潜在的故障点需要注意。这些问题通常源于文档分块方式、词嵌入生成方式或检索策略配置方式。如果处理不当,它们会导致不相关的结果、虚假结果或低质量的答案。了解这些痛点对于构建稳健可靠的 RAG 流程至关重要。在下一节中,我们将探讨 RAG 系统中最常见的挑战,并讨论如何在实践中识别和缓解这些挑战。
Just as prompt size overflow can disrupt the performance of a RAG system, there are several other challenges and potential failure points to be aware of. These issues often stem from how documents are chunked, how embeddings are generated, or how retrieval strategies are configured. If not properly addressed, they can lead to irrelevant results, hallucinations, or poor answer quality. Understanding these pain points is crucial for building robust and reliable RAG pipelines. In the next section, we will explore the most common challenges encountered in RAG systems and discuss how to identify and mitigate them in practice.
这篇题为《构建检索增强生成系统时的七个故障点》的论文强调,实际应用中的RAG系统需要强大的运行时验证——故障不能仅仅在设计阶段预测;它们必须在部署过程中不断演进。该论文为构建可靠系统的实践者提供了宝贵的见解,重点指出了最需要检查点和纠正机制的地方。
The paper, Seven Failure Points When Engineering a Retrieval Augmented Generation System, emphasizes that real-world RAG systems need robust runtime validation - you cannot predict failures solely at design time; they must evolve through deployment. It offers valuable insight for practitioners building reliable systems, highlighting where checkpoints and corrective mechanisms are most needed.
以下是如何在我们当前的 RAG 流程中解决论文中提到的七个 RAG 故障点的方法:
Here is how you can address each of the seven RAG failure points from the paper in our current RAG pipeline:
vectorstore.as_retriever(search_kwargs={"k": 8})
vectorstore.as_retriever(search_kwargs={"k": 8})
from langchain.retrievers import ContextualCompressionRetriever
retriever = ContextualCompressionRetriever(base_compressor=llm, base_retriever=hybrid_retriever)
from langchain.retrievers import ContextualCompressionRetriever
retriever = ContextualCompressionRetriever(base_compressor=llm, base_retriever=hybrid_retriever)
如果可以,请将答案格式化为 JSON 对象或项目符号列表。
Format your answer as a JSON object or bullet list if possible.
你目前所学的只是基础 RAG 系统的冰山一角。你已经构建了一个可运行的流程,该流程可以加载文档、将其分块、生成词嵌入、将其存储在向量数据库中,并检索相关的上下文信息以进行基于 LLM 的回答。你还实现了 ReAct 风格的提示和混合搜索。然而,RAG 系统非常复杂,面临着更深层次的挑战,例如提示优化、故障检测、可扩展性和评估。这种基础架构为你探索更高级的主题奠定了基础,例如工具增强型代理、知识图谱、动态路由和自定义检索器,这些主题都能在实际的 GenAI 应用中提供更高的控制力、精确度和灵活性。
What you have learned so far is just scratching the surface of foundational RAG systems. You have built a working pipeline that loads documents, chunks them, generates embeddings, stores them in a vector database, and retrieves relevant context for LLM-based answering. You have also implemented ReAct-style prompting and hybrid search. However, RAG systems are complex, with deeper challenges like prompt optimization, failure detection, scalability, and evaluation. This foundational setup prepares you to explore advanced topics such as tool-augmented agents, knowledge graphs, dynamic routing, and custom retrievers, each offering more control, precision, and flexibility in real-world GenAI applications.
现在,您可以将此 RAG 系统作为基础,开始探索其灵活性。作为课后作业,请尝试修改代码,试验不同的分块/分割策略、嵌入模型、LLM 以及检索或搜索方法。每个组件都是模块化的,可以轻松替换,使您可以根据特定的数据类型、性能需求或准确率目标来定制系统。这种实践性的定制将加深您对每一层如何为 GenAI 流水线的整体性能做出贡献的理解。
You can now take this RAG system as a foundation and begin exploring its flexibility. As a take-home assignment, try modifying the code to experiment with different chunking/splitting strategies, embedding models, LLMs, and retrieval or search methods. Each of these components is modular and easily swappable, allowing you to tailor the system to specific data types, performance needs, or accuracy goals. This hands-on customization will deepen your understanding of how each layer contributes to the overall performance of a GenAI pipeline.
在本章中,我们探索了现代 GenAI 系统的基本构建模块。我们了解了 GPU 在加速 AI 工作负载中的作用,以及如何利用本地 GPU 实现经济高效且注重隐私的替代方案。我们介绍了 Ollama,它是一款能够高效运行本地 LLM 的工具,并详细讲解了 RAG 系统的架构。您还学习了如何使用本地 LLM 生成 PDF 文档,并使用 LangChain、向量数据库和混合检索策略实现了一个完整的 RAG 流水线。最后,我们探讨了 RAG 面临的关键挑战。在下一章中,我们将使用 OpenAI 而非 Ollama 来实现基于 API 的 GenAI 系统。
In this chapter, we explored the essential building blocks of modern GenAI systems. We learned the role of GPUs in accelerating AI workloads and how using a local GPU can be a cost-effective, privacy-friendly alternative. We introduced Ollama as a tool to run local LLMs efficiently and walked through the architecture of RAG systems. You also learned to generate PDF documents using a local LLM, and implemented a complete a RAG pipeline using LangChain, vector databases, and hybrid retrieval strategies. Finally, we examined key challenges in RAG. In the next chapter, we will implement API-based GenAI systems using OpenAI instead of Ollama.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
本章在前一章的基础上,进一步探讨了如何使用 Ollama 和 LangChain 实现完全本地化的检索增强生成(RAG )系统。前一章侧重于隐私和离线执行,而本章则通过集成 OpenAI API,将重点转移到云端功能。这使我们能够扩展 GenAI 应用,并访问诸如生成式预训练 Transformer (GPT )等强大的模型,从而增强推理能力、扩大知识覆盖范围并处理更复杂的查询。我们的目标是扩展 RAG 系统,使其支持多文档查询。
In this chapter, we build upon the foundation laid in the previous chapter, where we implemented a fully local retrieval-augmented generation (RAG) system using Ollama and LangChain. While that approach prioritized privacy and offline execution, this chapter shifts focus to cloud-based capabilities by integrating the OpenAI API. This enables us to scale our GenAI applications with access to powerful models like generative pretrained transformer (GPT) and beyond, allowing for enhanced reasoning, broader knowledge coverage, and more complex query handling. Our goal is to extend the RAG system to support multi-document querying.
我们将探讨如何设计和实现多文档 GenAI 系统。通过结合 OpenAI 的 API 功能和周密的系统设计,您将学习如何构建更具可扩展性、灵活性和智能性的 GenAI 流水线,以适应企业级和云原生环境。
We will explore how to design and implement a multi-document GenAI system. By combining OpenAI’s API capabilities with thoughtful system design, you will learn how to build more scalable, flexible, and intelligent GenAI pipelines suited for enterprise and cloud-native environments.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在指导您使用基于 API 的大型语言模型( LLM )构建一个完全基于 API 的单模态 RAG 系统。您将学习如何使用 OpenAI 运行 LLM,使用Facebook AI Similarity Search ( Faiss ) 存储和搜索文档嵌入,以及使用 LangChain 管理检索和生成工作流程。重点在于创建一个可扩展、模块化的 GenAI 流水线,适用于企业级应用。
The objective of this chapter is to guide you through building a fully API-based, unimodal RAG system using API-based large language models (LLMs). You will learn to run LLMs with OpenAI, store and search document embeddings using Facebook AI Similarity Search (Faiss) and manage the retrieval and generation workflow using LangChain. The focus is on creating a scalable, modular GenAI pipeline suitable for an enterprise.
OpenAI是全球领先的人工智能研究和部署公司之一。它以其先进的生成模型而闻名,例如GPT、DALL·E(文本到图像生成)、Whisper(语音识别)和Sora(文本到视频生成)。这些模型旨在通过OpenAI API进行访问,使开发者能够构建涵盖文本生成、图像合成、音频转录等多个领域的智能应用。
OpenAI is one of the leading artificial intelligence research and deployment companies in the world. It is best known for its state-of-the-art generative models such as GPT, DALL·E (text-to-image generation), Whisper (speech recognition), and Sora (text-to-video generation). These models are designed to be accessed via the OpenAI API, which allows developers to build intelligent applications across a variety of domains, including text generation, image synthesis, audio transcription, and more.
本节全面概述了 OpenAI、其模型以及提供的各种 API。无论您是构建 RAG 系统、聊天机器人、摘要工具还是多模态应用程序,了解 OpenAI 的产品对于选择适合您项目的工具都至关重要。
This section provides a comprehensive overview of OpenAI, its models, and the different APIs it offers. Whether you are building a RAG system, a chatbot, a summarization tool, or a multimodal application, understanding OpenAI's offerings is essential for choosing the right tools for your project.
OpenAI成立于2015年12月,其使命是确保通用人工智能(AGI )造福全人类。OpenAI最初是一家非营利组织,后来转型为盈利上限模式,以吸引资本并继续专注于其使命。
Founded in December 2015, OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. Originally established as a non-profit, OpenAI transitioned into a capped-profit model to attract capital while remaining mission-focused.
OpenAI 以开发功能强大的语言模型而闻名,这些模型能够像人类一样理解和生成文本。自 2019 年发布 GPT-2,2020 年发布 GPT-3,以及后续的 GPT-3.5 和 GPT-4 等迭代版本以来,OpenAI 一直引领着生成式人工智能领域的标杆。
The organization is best known for developing powerful language models that are capable of human-like understanding and generation of text. With the release of GPT-2 in 2019, followed by GPT-3 in 2020, and subsequent iterations including GPT-3.5 and GPT-4, OpenAI has consistently set the benchmark in generative AI.
OpenAI API 通过 RESTful 接口提供对一系列模型的编程访问。这使得开发者能够将强大的 AI 功能集成到他们的应用程序中。该 API 文档齐全,并可通过官方软件开发工具包( SDK ) 访问,支持 Python、Node.js 等多种语言。
The OpenAI API provides programmatic access to a range of models via RESTful endpoints. This enables developers to integrate powerful AI capabilities into their applications. The API is well-documented and accessible through official software development kits (SDKs) in languages like Python, Node.js, and others.
OpenAI API涵盖的主要功能包括:
The main functionalities covered by the OpenAI API include:
下表总结了 OpenAI API 功能的主要类别,以及它们各自的端点和用例,重点介绍了可用于文本、图像、音频和模型管理操作的广泛功能:
The following table summarize the key categories of OpenAI API functionalities, along with their respective endpoints and use cases, highlighting the broad capabilities available for text, image, audio, and model management operations:
|
类别 Category |
端点 Endpoint(s) |
描述 Description |
|
文本生成 Text generation |
/v1/completions,/v1/chat/completions /v1/completions, /v1/chat/completions |
生成或继续自然语言文本 Generate or continue natural language text |
|
编辑 Editing |
/v1/编辑 /v1/edits |
修改现有文本或代码 Modify existing text or code |
|
嵌入 Embeddings |
/v1/嵌入 /v1/embeddings |
为文本生成矢量表示 Generate vector representations for text |
|
图像生成 Image generation |
/v1/images/generations、/v1/images/edits、/v1/images/variations /v1/images/generations, /v1/images/edits, /v1/images/variations |
使用 DALL·E 创建和编辑图像 Create and edit images using DALL·E |
|
音频处理 Audio processing |
/v1/audio/transcriptions,/v1/audio/translations /v1/audio/transcriptions, /v1/audio/translations |
将语音转换为文本并翻译 Convert speech-to-text and translate |
|
适度 Moderation |
/v1/moderations /v1/moderations |
检测有害或敏感内容 Detect harmful or sensitive content |
|
文件处理 File handling |
/v1/文件 /v1/files |
上传和管理文件以进行微调 Upload and manage files for fine-tuning |
|
微调 Fine-tuning |
/v1/微调 /v1/fine-tunes |
使用您自己的数据创建自定义模型 Create custom models using your own data |
|
型号列表 Model listing |
/v1/模型 /v1/models |
获取可用模型 Retrieve available models |
Table 4.1: OpenAI API endpoint overview tables
这些端点为各种用例提供了一套强大的工具包,从构建聊天机器人和摘要器到开发全功能 AI 助手。
These endpoints provide a robust toolkit for a variety of use cases, from building chatbots and summarizers to developing full-scale AI assistants.
OpenAI 提供多个模型系列,每个模型系列都针对特定类型的任务而设计。以下是截至 2025 年可用的主要模型概览:
OpenAI offers several families of models, each designed for specific types of tasks. Here is a breakdown of the major models available as of 2025:
这些模型在性能、成本效益和推理能力方面各有不同,开发人员可以根据其具体的应用程序需求进行选择。
Each of these models offers different levels of performance, cost-efficiency, and reasoning capabilities, allowing developers to choose based on their specific application needs.
这些模型允许用户根据文本描述生成和编辑图像,从而实现各种创意和实用应用。
These models allow users to generate and edit images based on textual descriptions, enabling a wide range of creative and practical applications.
要使用 OpenAI 模型,开发者通常需要执行以下步骤:
To use OpenAI models, developers typically perform the following steps:
1. 创建一个 OpenAI 帐户并获取 API 密钥。
1. Create an OpenAI account and obtain an API key.
2. 选择适合他们任务的模型。
2. Choose a model appropriate for their task.
3. 使用他们选择的编程语言调用相关端点。
3. Call the relevant endpoint using a programming language of their choice.
4. 将响应集成到应用程序逻辑中。
4. Integrate the responses into their application logic.
以下是一个用 Python 列出所有可用模型的示例:
Here is an example in Python to list all available models:
导入 openai
import openai
openai.api_key = "你的api密钥"
openai.api_key = "your-api-key"
models = openai.Model.list()
models = openai.Model.list()
对于 models.data 中的每个模型:
for model in models.data:
print(model.id)
print(model.id)
这有助于您动态获取和使用您有权访问的模型。
This helps you dynamically fetch and utilize the models you have access to.
在构建应用程序时,模型的选择取决于多种因素,例如以下几点:
When building an application, the choice of model depends on multiple factors, like the following:
如果您刚开始使用 OpenAI 模型,以下是一些最佳实践,可帮助您有效地构建模型并避免常见陷阱:
If you are just getting started with OpenAI models, here are some best practices to help you build effectively and avoid common pitfalls:
OpenAI 提供强大的模型和 API 生态系统,用于构建 AI 应用。该平台拥有超过 20 种涵盖文本、图像、音频和视频模态的模型,足以支持各种应用场景。无论您是构建基于云的 RAG 系统、多模态助手,还是企业级 GenAI 平台,了解 OpenAI 的产品和服务都是打造高效 AI 解决方案的第一步。
OpenAI offers a powerful ecosystem of models and APIs for building AI-enabled applications. With over 20 models across text, image, audio, and video modalities, the platform is robust enough to support a wide range of use cases. Whether you are building a cloud-based RAG system, a multimodal assistant, or an enterprise-level GenAI platform, understanding OpenAI's offerings is the first step toward creating impactful AI solutions.
通过掌握 OpenAI API,开发者能够创建智能、可扩展且面向未来的应用程序,这些应用程序可以利用当今一些最先进的 AI 功能。
By mastering the OpenAI API, developers unlock the ability to create intelligent, scalable, and future-ready applications that leverage some of the most advanced AI capabilities available today.
随着 OpenAI 模型日趋成熟,其功能已从生成文本扩展到执行多步骤推理和基于工具的任务。OpenAI 最初以 GPT-3 和 GPT-4 等擅长语言理解和生成的模型而闻名,如今已发展出支持更自主、更具交互性的系统的生态系统,为智能体人工智能的出现铺平了道路。
As OpenAI's models have matured, their capabilities have expanded beyond generating text to performing multi-step reasoning and tool-based task execution. Initially known for models like GPT-3 and GPT-4, which excel at language understanding and generation, OpenAI has evolved its ecosystem to support more autonomous and interactive systems—paving the way for agentic AI.
智能体人工智能标志着人工智能从被动文本生成向主动决策、工具使用和自主工作流程的重大转变。随着响应 API 和代理 SDK 的推出,开发者现在可以构建能够推理任务、调用网络搜索或文件检索等工具,并以最小的人工干预协调复杂交互的智能代理。
Agentic AI represents a significant shift from passive text generation to active decision-making, tool use, and autonomous workflows. With the introduction of the Responses API and the Agents SDK, developers can now build intelligent agents capable of reasoning over tasks, invoking tools like web search or file retrieval, and orchestrating complex interactions with minimal intervention.
这一转变体现了OpenAI更广泛的使命,即创建不仅智能,而且实用、适应性强、能够感知上下文的系统。通过诸如Operator(用于浏览器任务)和Codex(用于软件开发)等框架,OpenAI能够构建可在现实世界中行动的智能体,而不仅仅是模拟对话。
This transition reflects OpenAI’s broader mission to create systems that are not only intelligent but also useful, adaptive, and context-aware. Through frameworks like Operator (for browser tasks) and Codex (for software development), OpenAI enables agents that can act in the real-world, not just simulate conversation.
以下部分将探讨一些细节。
The following section explores some of the details.
OpenAI 推出了一套强大的工具和 API,用于构建基于代理的系统,统称为代理生态系统。这些接口旨在支持更复杂、更自主的工作流程,使模型能够以结构化的方式进行推理、调用工具、执行任务并与数字环境交互。本节概述了 OpenAI 代理基础架构的核心组件,包括响应 API、代理 SDK、Operator 以及特定领域的代理,例如 Codex。
OpenAI has introduced a powerful set of tools and APIs for building agent-based systems, collectively referred to as the agentic ecosystem. These interfaces are designed to support more complex and autonomous workflows where models can reason, invoke tools, perform tasks, and interact with digital environments in a structured manner. This section provides an overview of the core components of OpenAI’s agentic infrastructure, including the Responses API, the Agents SDK, Operator, and domain-specific agents like Codex.
响应 API 于 2025 年初发布,是 OpenAI 构建智能体应用程序的主要接口。它扩展了标准聊天补全 API 的功能,只需一次 API 调用即可实现文本推理和工具调用。以及有状态的上下文管理。通过响应 API,开发人员可以协调交互,使模型能够以连贯的顺序执行诸如文件查找、网络搜索或基于工具的计算等任务。
The Responses API, launched in early 2025, serves as OpenAI’s primary interface for building agentic applications. It extends the capabilities of the standard Chat Completions API by enabling a single API call to include not only textual reasoning but also tool invocation and stateful context management. Through the Responses API, developers can orchestrate interactions where the model performs tasks such as file lookups, web searches, or tool-based computations in a coherent sequence.
此 API 支持集成推理和动作循环,因此特别适用于需要动态工作流的应用。它的设计目标是最终取代助手 API,为智能体行为提供更精简、可扩展的基础架构。
This API supports integrated reasoning and action loops, making it particularly useful for applications that require dynamic workflows. It is designed to eventually replace assistant APIs, providing a more streamlined and scalable foundation for agentic behavior.
为了支持复杂工作流和多智能体系统的开发,OpenAI 提供了官方的 Agents SDK。该 SDK 同时支持 Python 和 JavaScript/TypeScript,提供了诸如智能体、工具、工作流、防护机制和交接等基本组件。开发者可以使用该 SDK 定义智能体逻辑、管理工具交互,并协调多个 AI 智能体之间的操作。
To support the development of complex workflows and multi-agent systems, OpenAI provides an official Agents SDK. Available in both Python and JavaScript/TypeScript, this SDK offers primitives such as agents, tools, workflows, guardrails, and handoffs. Developers can use the SDK to define agent logic, manage tool interactions, and coordinate actions across multiple AI agents.
该SDK支持以下功能:
The SDK facilitates features such as:
例如,使用 Python SDK,只需极少的设置即可实例化并执行代理:
For example, using the Python SDK, an agent can be instantiated and executed with minimal setup:
f from agents import Agent, Runner
from agents import Agent, Runner
agent = Agent(name="助理", instructions="您是一位乐于助人的助理")
agent = Agent(name="Assistant", instructions="You are a helpful assistant")
result = Runner.run_sync(agent, "写一首关于递归的俳句")
result = Runner.run_sync(agent, "Write a haiku about recursion")
print(result.final_output)
print(result.final_output)
这种抽象化使得开发人员能够专注于业务逻辑,而 SDK 则负责处理编排工作。
This abstraction allows developers to focus on the business logic while the SDK handles orchestration.
Operator 是 OpenAI 开发的一款自主智能体,用于执行基于 Web 的任务。Operator 于 2025 年推出,它允许 AI 系统执行诸如浏览网站、填写表单以及与图形用户界面交互等操作。它基于 Responses API 构建,将推理与现实世界的操作连接起来,使智能体能够完成传统上需要人工干预的工作流程。
Operator is an autonomous agent developed by OpenAI for executing web-based tasks. Introduced in 2025, Operator allows AI systems to perform actions such as navigating websites, filling forms, and interacting with graphical user interfaces. It builds on the Responses API to bridge reasoning with real-world action, making it possible for agents to complete workflows that traditionally required human intervention.
此功能对于订单处理、自动化客户支持和表单驱动的工作流程等用例尤其有用,在这些用例中,代理需要操作基于浏览器的界面。
This capability is particularly useful for use cases such as order placement, automated customer support, and form-driven workflows, where the agent needs to operate a browser-based interface.
Codex 是 OpenAI 专为软件开发而设计的智能体人工智能系统。Codex 于 2025 年 5 月发布,能够生成、调试和执行代码。它不仅限于简单的代码生成,还能让智能体运行测试、进行编辑,并与现有软件系统交互,从而完成用户定义的编程任务。
Codex is OpenAI’s agentic AI system designed specifically for software development. Released in May 2025, Codex is capable of generating, debugging, and executing code. It extends beyond simple code generation by enabling agents to run tests, make edits, and interact with existing software systems to fulfill user-defined programming tasks.
Codex可通过OpenAI的开发者平台访问,也是高级订阅计划的一部分。它与Responses API和Agents SDK无缝集成,支持软件工程、自动化和DevOps等应用场景。
Codex is accessible via OpenAI’s developer platform and as part of higher-tier subscription plans. It integrates seamlessly with the Responses API and Agents SDK, supporting use cases in software engineering, automation, and DevOps.
在推出 Responses API 之前,OpenAI 提供 Assistants API(旧版 API),旨在通过结构化的线程式界面实现工具增强型对话。虽然该 API 仍然可用,但正逐步被更灵活、更强大的 Responses API 取代。我们鼓励开发者过渡到新的智能体堆栈,因为未来的开发和支持将围绕 Responses API 和 Agents SDK 展开。
Prior to the Responses API, OpenAI provided the Assistants API (Legacy API) to facilitate tool-augmented conversations within a structured thread-based interface. While still available, this API is being phased out in favor of the more flexible and powerful Responses API. Developers are encouraged to transition to the new agentic stack, as future development and support will center around the Responses API and the Agents SDK.
在前面的章节中,我们重点构建了每次查询单个文档的系统,这种方法对于范围窄、定义明确的任务非常有效。然而,实际应用往往需要跨多个文档进行推理,以收集上下文信息、比较信息或综合分析洞察。过渡到多文档查询方法使我们的系统能够处理更广泛、更复杂的用户意图。这种转变需要重新思考我们如何对信息进行分块、嵌入和检索,以确保跨不同来源的信息具有相关性和一致性。在接下来的章节中,我们将探讨支持多文档查询的策略,以及如何将它们集成到可扩展的 RAG 流程中。
In earlier chapters, we focused on building systems that query a single document at a time, which is effective for narrow, well-defined tasks. However, real-world applications often require reasoning across multiple-documents to gather context, compare information, or synthesize insights. Transitioning to a multi-document query approach allows our system to handle broader and more complex user intents. This shift involves rethinking how we chunk, embed, and retrieve information, ensuring relevance and coherence across diverse sources. In the following sections, we will explore strategies to support multi-document querying and how to integrate them into a scalable RAG pipeline.
我们将使用以下代码生成多文档:
We will use the following code to generate multi-documents:
导入请求
import requests
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.pagesizes import LETTER
从 reportlab.pdfgen 导入 canvas
from reportlab.pdfgen import canvas
导入 textwrap
import textwrap
导入操作系统
import os
OLLAMA_URL = “http://localhost:11434/api/generate”
OLLAMA_URL = “http://localhost:11434/api/generate”
MODEL_NAME = "llama3.2:3b-instruct-fp16"
MODEL_NAME = "llama3.2:3b-instruct-fp16"
def generate_text(topic, max_words=600):
def generate_text(topic, max_words=600):
提示 = (
prompt = (
请撰写一篇关于“{topic}”的说明性文章,字数约为{max_words}。
f"Write an informative article about '{topic}' with approximately {max_words} words. "
f“文章结构应包括引言、正文和结论。”
f"Structure the article with an introduction, body, and conclusion."
)
)
response = requests.post(OLLAMA_URL, json={
response = requests.post(OLLAMA_URL, json={
“模型”: MODEL_NAME,
"model": MODEL_NAME,
“提示”:提示,
"prompt": prompt,
“流”:否
"stream": False
})
})
如果 response.status_code == 200:
if response.status_code == 200:
返回 response.json()["response"].strip()
return response.json()["response"].strip()
别的:
else:
raise Exception(f"错误:{response.status_code} - {response.text}")
raise Exception(f"Error: {response.status_code} - {response.text}")
def save_to_pdf(text, filename):
def save_to_pdf(text, filename):
pdf = canvas.Canvas(filename, pagesize=LETTER)
pdf = canvas.Canvas(filename, pagesize=LETTER)
宽度,高度 = 字母
width, height = LETTER
边距 = 50
margin = 50
text_object = pdf.beginText(margin, height - margin)
text_object = pdf.beginText(margin, height - margin)
text_object.setFont("Times-Roman", 12)
text_object.setFont("Times-Roman", 12)
wrapped_lines = []
wrapped_lines = []
for paragraph in text.split("\n"):
for paragraph in text.split("\n"):
wrapped_lines.extend(textwrap.wrap(paragraph, width=90))
wrapped_lines.extend(textwrap.wrap(paragraph, width=90))
wrapped_lines.append("")
wrapped_lines.append("")
对于 wrapped_lines 中的每个行:
for line in wrapped_lines:
text_object.textLine(line)
text_object.textLine(line)
如果 text_object.getY() < margin:
if text_object.getY() < margin:
pdf.drawText(text_object)
pdf.drawText(text_object)
pdf.showPage()
pdf.showPage()
text_object = pdf.beginText(margin, height - margin)
text_object = pdf.beginText(margin, height - margin)
text_object.setFont("Times-Roman", 12)
text_object.setFont("Times-Roman", 12)
pdf.drawText(text_object)
pdf.drawText(text_object)
pdf.save()
pdf.save()
如果 __name__ == "__main__":
if __name__ == "__main__":
主题 = [
topics = [
“可再生能源的未来”
"The Future of Renewable Energy",
“通用人工智能的益处和风险”
"Benefits and Risks of Artificial General Intelligence",
“区块链如何改变金融服务”
"How Blockchain is Transforming Financial Services",
“心理健康意识的重要性”
"The Importance of Mental Health Awareness",
“气候变化及其对全球农业的影响”
"Climate Change and Its Impact on Global Agriculture"
]
]
os.makedirs("generated_articles", exist_ok=True)
os.makedirs("generated_articles", exist_ok=True)
对于 topics 中的 topic:
for topic in topics:
尝试:
try:
print(f"正在生成关于 {topic} 的文章")
print(f"Generating article on: {topic}")
文章 = generate_text(主题)
article = generate_text(topic)
safe_title = topic.lower().replace(" ", "_").replace(",, "").replace(".", "")
safe_title = topic.lower().replace(" ", "_").replace(",", "").replace(".", "")
文件名 = f"generated_articles/{safe_title}.pdf"
filename = f"generated_articles/{safe_title}.pdf"
save_to_pdf(文章, 文件名)
save_to_pdf(article, filename)
print(f"PDF 生成成功:{filename}")
print(f"PDF generated successfully: {filename}")
除异常 e 外:
except Exception as e:
print(f"无法为主题“{topic}”生成文章:{str(e)}")
print(f"Failed to generate article for topic '{topic}': {str(e)}")
本节详细介绍了一个模块化的 RAG 系统。该系统结合了 OpenAI 的 GPT-4o 进行质量保证,并使用 Faiss 作为向量数据库以实现高效的文档检索。为了清晰性和可维护性,每个组件都封装在一个独立的模块中。
This section provides a detailed walkthrough of a modular RAG system. It combines OpenAI's GPT-4o for QA with Faiss as the vector database for efficient document retrieval. Each component is encapsulated in a separate module for clarity and maintainability.
下图展示了一个使用 OpenAI 模型进行嵌入和答案生成的元数据感知型多文档 RAG 系统的架构。图中重点展示了如何通过混合检索机制处理查询,该机制结合了向量相似性和元数据过滤,以确保提供准确且与来源相关的响应:
The following figure illustrates the architecture of a metadata-aware multi-document RAG system using OpenAI's models for both embeddings and answer generation. It highlights how queries are processed through a hybrid retrieval mechanism that combines vector similarity with metadata filtering to ensure accurate, source-specific responses:
图 4.1:基于 OpenAI 的元数据过滤混合 RAG 架构
Figure 4.1: Metadata-filtered hybrid RAG architecture using OpenAI
该脚本是 RAG 系统的用户界面。它加载 RAG 链并进入持续循环以接受用户提问。
This script is the user-facing interface of the RAG system. It loads the RAG chain and enters a continuous loop to accept user questions.
收到输入后,它会调用 RAG 管道来检索相关的文档块并生成自然语言响应。
Upon receiving input, it invokes the RAG pipeline to retrieve relevant document chunks and generate a natural language response.
它会打印最终答案以及生成答案时使用的源文档引用。此脚本确保用户与系统之间无缝交互:
It prints the final answer as well as references to source documents used in the answer generation. This script ensures seamless interaction between the user and the system:
#main.py
#main.py
from orchestrator.rag_chain import get_rag_chain
from orchestrator.rag_chain import get_rag_chain
print("RAG 系统已准备就绪。输入“exit”退出。")
print("RAG System Ready. Type 'exit' to quit.")
invoke_rag_chain = get_rag_chain()
invoke_rag_chain = get_rag_chain()
当 True 时:
while True:
查询 = 输入("\n用户: ")
query = input("\nUser: ")
如果 query.lower() 在 ['exit', 'quit'] 中:
if query.lower() in ['exit', 'quit']:
休息
break
result = invoke_rag_chain(query)
result = invoke_rag_chain(query)
print("\n助理:", result["答案"])
print("\nAssistant:", result["answer"])
print("\n来源:")
print("\nSources:")
for doc in result.get("source_documents", []):
for doc in result.get("source_documents", []):
print("-", doc.metadata.get("source", "[unknown]"))
print("-", doc.metadata.get("source", "[unknown]"))
该模块集中管理关键系统常量,例如模型名称、嵌入标识符、API 密钥和文件路径。它定义了源 PDF 和矢量数据库的位置,从而可以轻松调整系统设置而无需修改核心逻辑。通过将这些值集中到一个位置,该脚本确保了一致性,并简化了调试和环境移植。它在使流程易于配置和模块化方面发挥着基础性作用。
This module centralizes key system constants such as model names, embedding identifiers, API keys, and file paths. It defines the location of source PDFs and the vector database, making it easy to adjust the system setup without editing core logic. By consolidating these values in a single location, the script ensures consistency and facilitates easier debugging and environment portability. It plays a foundational role in making the pipeline easily configurable and modular.
#config.py
#config.py
MODEL_NAME = "gpt-4o"
MODEL_NAME = "gpt-4o"
嵌入模型 = "text-embedding-3-small"
EMBEDDING_MODEL = "text-embedding-3-small"
OPENAI_API_KEY = "您的 API 密钥"
OPENAI_API_KEY = "your-api-key"
VECTOR_DB_PATH = "db"
VECTOR_DB_PATH = "db"
SOURCE_DOCS = [
SOURCE_DOCS = [
"data/source_docs/ai_education_article.pdf",
"data/source_docs/ai_education_article.pdf",
"data/source_docs/how_blockchain_is_transforming_financial_services.pdf"
"data/source_docs/how_blockchain_is_transforming_financial_services.pdf"
]
]
该脚本初始化配置中指定的 OpenAI 嵌入模型。它作为抽象层,将原始文档文本转换为稠密的数值向量嵌入。这些向量随后会被检索引擎用于识别与用户查询语义相近的相关数据块。该模块确保嵌入生成的过程可重用、封装,并且如果后端模型发生变化,也易于替换。
This script initializes the OpenAI embedding model specified in the configuration. It serves as an abstraction layer to convert raw document text into dense numerical vector embeddings. These vectors are later used by the retrieval engine to identify relevant chunks semantically close to user queries. The module ensures that embedding generation is reusable, encapsulated, and easy to swap if the backend model changes.
#embedder.py
#embedder.py
from langchain_openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from config import EMBEDDING_MODEL, OPENAI_API_KEY
from config import EMBEDDING_MODEL, OPENAI_API_KEY
def get_embedding_model():
def get_embedding_model():
返回 OpenAIEmbeddings(
return OpenAIEmbeddings(
模型=嵌入式模型,
model=EMBEDDING_MODEL,
api_key=OPENAI_API_KEY
api_key=OPENAI_API_KEY
)
)
该模块负责管理 Faiss 向量数据库。它首先检查本地是否已存在向量索引,如果存在则加载该索引,从而避免重复计算。
This module is responsible for managing the Faiss vector database. It first checks if a vector index already exists locally and loads it if available, avoiding redundant computation.
如果索引不存在,它会从文档块生成向量嵌入,并创建一个新的 Faiss 索引。
If the index does not exist, it generates vector embeddings from the document chunks and creates a new Faiss index.
此设置支持 RAG 管道中下游搜索组件的持久性和快速检索。
This setup supports persistence and fast retrieval for downstream search components in the RAG pipeline.
#db_handler.py
#db_handler.py
导入操作系统
import os
from pathlib import Path
from pathlib import Path
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import FAISS
from embeddings.embedder import get_embedding_model
from embeddings.embedder import get_embedding_model
from config import VECTOR_DB_PATH
from config import VECTOR_DB_PATH
def get_vectorstore(documents):
def get_vectorstore(documents):
embedding_model = get_embedding_model()
embedding_model = get_embedding_model()
index_file = Path(VECTOR_DB_PATH) / "index.faiss"
index_file = Path(VECTOR_DB_PATH) / "index.faiss"
store_file = Path(VECTOR_DB_PATH) / "index.pkl"
store_file = Path(VECTOR_DB_PATH) / "index.pkl"
如果 index_file.exists() 和 store_file.exists():
if index_file.exists() and store_file.exists():
返回 FAISS.load_local(
return FAISS.load_local(
VECTOR_DB_PATH,
VECTOR_DB_PATH,
嵌入模型,
embedding_model,
allow_dangerous_deserialization=True
allow_dangerous_deserialization=True
)
)
vectorstore = FAISS.from_documents(
vectorstore = FAISS.from_documents(
文件,
documents,
嵌入=嵌入模型
embedding=embedding_model
)
)
vectorstore.save_local(VECTOR_DB_PATH)
vectorstore.save_local(VECTOR_DB_PATH)
返回向量存储
return vectorstore
该实用脚本会为每个文档块添加元数据,以追踪其源文件名。这些元数据随后用于在检索和响应生成过程中进行归属和过滤。通过标记每个文本块的来源,系统可以确保 RAG 响应的透明度、可追溯性和可解释性。
This utility script enriches each document chunk with metadata that tracks its source filename. This metadata is later used to provide attribution and filtering during retrieval and response generation. By tagging the origin of each text chunk, the system can ensure transparency, traceability, and explainability in RAG responses.
该元数据还支持特定主题的筛选,并通过显示来源信息来提高用户信任度。
This metadata also supports topic-specific filtering and improves user trust by surfacing source information.
#metadata_schema.py
#metadata_schema.py
def add_metadata_to_chunks(chunks, source_name):
def add_metadata_to_chunks(chunks, source_name):
对于 chunks 中的每个 chunk:
for chunk in chunks:
如果不是 chunk.metadata:
if not chunk.metadata:
chunk.metadata = {}
chunk.metadata = {}
chunk.metadata["source"] = source_name
chunk.metadata["source"] = source_name
返回数据块
return chunks
该模块负责源 PDF 文档的导入和预处理。它加载每个文件,提取原始文本,然后将内容分割成重叠的、语义相关的块。每个块还会添加元数据,例如源文件名,从而在检索过程中实现更好的可追溯性。这种模块化方法既能为嵌入和检索做好文档准备,又能支持灵活的文档管理。
This module handles the ingestion and preprocessing of source PDF documents. It loads each file, extracts the raw text, and then splits the content into overlapping, semantically relevant chunks. Each chunk is further enriched with metadata such as the source filename, enabling better traceability during retrieval. This modular approach prepares the documents for embedding and retrieval while supporting flexible document management.
#pdf_parser.py
#pdf_parser.py
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from config import SOURCE_DOCS
from config import SOURCE_DOCS
from vectorstore.metadata_schema import add_metadata_to_chunks
from vectorstore.metadata_schema import add_metadata_to_chunks
导入操作系统
import os
def load_and_chunk_pdfs():
def load_and_chunk_pdfs():
all_chunks = []
all_chunks = []
splitter = RecursiveCharacterTextSplitter(
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_size=500,
chunk_overlap=50,
chunk_overlap=50,
分隔符=["\n\n", "\n", " ", ""]
separators=["\n\n", "\n", " ", ""]
)
)
对于 SOURCE_DOCS 中的每个路径:
for path in SOURCE_DOCS:
loader = PyPDFLoader(path)
loader = PyPDFLoader(path)
documents = loader.load()
documents = loader.load()
chunks = splitter.split_documents(documents)
chunks = splitter.split_documents(documents)
source_name = os.path.basename(path)
source_name = os.path.basename(path)
enriched_chunks = add_metadata_to_chunks(chunks, source_name)
enriched_chunks = add_metadata_to_chunks(chunks, source_name)
all_chunks.extend(enriched_chunks)
all_chunks.extend(enriched_chunks)
返回所有块
return all_chunks
该模块基于主题相关性筛选数据块,并结合BM25和向量检索以提高准确率。新增基于关键词-主题映射的筛选步骤,可在创建BM25和向量检索器之前,根据主题动态限制数据块。
This module filters chunks based on topic relevance and combines BM25 and vector retrieval for improved accuracy. Adding a filtering step based on keyword-topic mapping to dynamically restrict chunks by topic before creating BM25 and vector retrievers.
这是通过修改检索逻辑来实现的,使其在评分和合并文档之前,先根据元数据对文档进行筛选。以下是模块化实现方法:
This is done by modifying the retriever logic to filter documents by metadata before scoring and combining them. The following is how you achieve it modularly:
#混合搜索.py
#hybrid_search.py
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.retrievers import BM25Retriever, EnsembleRetriever
def filter_chunks_by_topic(chunks, topic):
def filter_chunks_by_topic(chunks, topic):
topic = topic.lower()
topic = topic.lower()
如果主题中包含“blockchain”或“crypto”:
if "blockchain" in topic or "crypto" in topic:
返回 [c for c in chunks if "blockchain" in c.metadata.get("source", "").lower()]
return [c for c in chunks if "blockchain" in c.metadata.get("source", "").lower()]
elif "education" in topic or "ai" in topic or "artificial intelligence" in topic:
elif "education" in topic or "ai" in topic or "artificial intelligence" in topic:
返回 [c for c in chunks if "education" in c.metadata.get("source", ").lower()]
return [c for c in chunks if "education" in c.metadata.get("source", "").lower()]
别的:
else:
返回数据块
return chunks
def get_hybrid_retriever(chunks, vectorstore, topic=None):
def get_hybrid_retriever(chunks, vectorstore, topic=None):
filtered_chunks = filter_chunks_by_topic(chunks, topic)
filtered_chunks = filter_chunks_by_topic(chunks, topic)
bm25_retriever = BM25Retriever.from_documents(filtered_chunks)
bm25_retriever = BM25Retriever.from_documents(filtered_chunks)
bm25_retriever.k = 4
bm25_retriever.k = 4
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
返回 EnsembleRetriever(
return EnsembleRetriever(
检索器=[bm25_retriever, vector_retriever],
retrievers=[bm25_retriever, vector_retriever],
权重=[0.5, 0.5]
weights=[0.5, 0.5]
)
)
该模块根据配置设置初始化用于生成响应的核心语言学习模型(LLM),例如 OpenAI 的 GPT-4o。它将模型封装在 LangChain 的抽象层中,以便于与检索链和记忆组件集成。该模型配置为低温模式,以倾向于生成确定性且信息丰富的答案。该组件作为 RAG 系统的生成骨干,能够根据检索到的内容生成类似人类的响应。
This module initializes the core LLM used for response generation, such as OpenAI's GPT-4o, based on configuration settings. It wraps the model inside LangChain’s abstraction to allow easy integration with retrieval chains and memory components. The model is configured with a low temperature to favor deterministic, informative answers. This component serves as the generative backbone of the RAG system, producing human-like responses from the retrieved content.
#generate.py
#generate.py
from langchain_openai import ChatOpenAI
from langchain_openai import ChatOpenAI
从 config 导入 MODEL_NAME、OPENAI_API_KEY
from config import MODEL_NAME, OPENAI_API_KEY
def get_llm():
def get_llm():
返回 ChatOpenAI(
return ChatOpenAI(
model=MODEL_NAME,
model=MODEL_NAME,
温度=0.2,
temperature=0.2,
api_key=OPENAI_API_KEY
api_key=OPENAI_API_KEY
)
)
该模块定义了结构化提示,指导逻辑学习模型(LLM)遵循推理与行动(ReAct )范式。它鼓励模型在给出最终答案之前,先列出中间推理步骤。这提高了可解释性,减少了模型的想象,并确保模型的推理与检索到的上下文保持一致。在复杂的查询场景中,它能够实现更透明、更可审计的答案生成过程。
This module defines the structured prompt that instructs the LLM to follow the reasoning and acting (ReAct) paradigm. It encourages the model to first list intermediate reasoning steps before providing a final answer. This improves interpretability, reduces hallucination, and ensures the model aligns its reasoning with the retrieved context. It enables a more transparent and auditable answer generation process in complex query scenarios.
#react_prompt.py
#react_prompt.py
from langchain.prompts import PromptTemplate
from langchain.prompts import PromptTemplate
react_prompt = PromptTemplate(
react_prompt = PromptTemplate(
input_variables=["context", "question"],
input_variables=["context", "question"],
模板=""
template="""
您是一位使用 ReAct(推理 + 行动)技术的智能助手。
You are an intelligent assistant using the ReAct (Reasoning + Acting) technique.
将用户查询分解为推理步骤,并据此检索相关信息。
Break down the user query into reasoning steps and retrieve relevant information accordingly.
问题:{question}
Question: {question}
相关背景:
Relevant Context:
{语境}
{context}
首先,请清晰地列出你的推理步骤。
First, list your reasoning steps clearly.
然后,根据这些步骤和检索到的上下文提供最终答案。
Then, provide a final answer based on those steps and the retrieved context.
推理步骤:
Reasoning Steps:
1.
1.
"""
"""
)
)
这是 RAG 系统中所有组件的连接编排层。它负责加载和预处理文档、构建或加载向量存储、配置 LLM 和检索器,并将它们绑定到一个统一的管道中。它会根据用户的查询动态构建混合检索器,以提高检索的相关性。该函数返回一个可调用接口,该接口可以端到端地处理用户问题,生成具有源可追溯性的高质量答案。
This is the orchestration layer that wires together all components in the RAG system. It loads and preprocesses documents, builds or loads the vector store, configures the LLM and retriever, and binds them into a unified pipeline. It dynamically builds a hybrid retriever based on the user’s query to enhance retrieval relevance. The function returns a callable interface that processes user questions end-to-end, generating high-quality answers with source traceability.
#rag_chain.py
#rag_chain.py
from utils.pdf_parser import load_and_chunk_pdfs
from utils.pdf_parser import load_and_chunk_pdfs
from vectorstore.db_handler import get_vectorstore
from vectorstore.db_handler import get_vectorstore
from retriever.hybrid_search import get_hybrid_retriever
from retriever.hybrid_search import get_hybrid_retriever
from llm.generate import get_llm
from llm.generate import get_llm
从 memory.conversation_buffer 导入 memory
from memory.conversation_buffer import memory
from llm.react_prompt import react_prompt
from llm.react_prompt import react_prompt
from langchain.chains import ConversationalRetrievalChain
from langchain.chains import ConversationalRetrievalChain
def get_rag_chain():
def get_rag_chain():
chunks = load_and_chunk_pdfs()
chunks = load_and_chunk_pdfs()
vectorstore = get_vectorstore(chunks)
vectorstore = get_vectorstore(chunks)
llm = get_llm()
llm = get_llm()
def invoke_rag_chain(query: str):
def invoke_rag_chain(query: str):
hybrid_retriever = get_hybrid_retriever(chunks, vectorstore, topic=query)
hybrid_retriever = get_hybrid_retriever(chunks, vectorstore, topic=query)
rag = ConversationalRetrievalChain.from_llm(
rag = ConversationalRetrievalChain.from_llm(
llm=llm,
llm=llm,
检索者=杂交寻回犬,
retriever=hybrid_retriever,
内存=内存,
memory=memory,
return_source_documents=True,
return_source_documents=True,
combine_docs_chain_kwargs={"prompt": react_prompt},
combine_docs_chain_kwargs={"prompt": react_prompt},
output_key="answer"
output_key="answer"
)
)
return rag.invoke({"question": query})
return rag.invoke({"question": query})
返回 invoke_rag_chain
return invoke_rag_chain
该模块配置一个内存缓冲区,用于跨多个回合保留过去的用户查询和助手回复。它使系统能够延续对话上下文,从而实现以下目标:后续问题更加连贯,也更能理解上下文。通过存储交互历史记录,它将助手转变为真正交互式的对话代理。这对于保持用户会话的连续性以及提升整体用户体验至关重要。
This module configures a memory buffer to retain past user queries and assistant responses across multiple turns. It enables the system to carry forward conversational context, making follow-up questions more coherent and contextually aware. By storing interaction history, it transforms the assistant into a truly interactive and conversational agent. This is critical for maintaining continuity in user sessions and improving the overall user experience.
#conversation_buffer.py
#conversation_buffer.py
from langchain.memory import ConversationBufferMemory
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
memory = ConversationBufferMemory(
memory_key="chat_history",
memory_key="chat_history",
return_messages=True,
return_messages=True,
output_key="answer"
output_key="answer"
)
)
此文件列出了安装和运行 RAG 系统所需的所有 Python 包。它包括 LangChain 模块、OpenAI SDK、Faiss 等矢量数据库库以及 PDF 处理工具。该文件支持使用`pip install -r requirements.txt`命令轻松启动环境。维护此文件可确保跨团队或部署的可复现性、可移植性和协作性。
This file lists all required Python packages needed to install and run the RAG system. It includes LangChain modules, OpenAI SDKs, vector database libraries like Faiss, and PDF processing tools. The file allows easy environment bootstrapping using pip install -r requirements.txt. Maintaining this file ensures reproducibility, portability, and collaboration across teams or deployments.
#requirements.txt
#requirements.txt
langchain
langchain
langchain-community
langchain-community
langchain-openai
langchain-openai
faiss-cpu
faiss-cpu
报告实验室
reportlab
rank_bm25
rank_bm25
pypdf
pypdf
openai
openai
请使用以下命令进行安装:
Use the following command to install them:
pip install -r requirements.txt
pip install -r requirements.txt
这种模块化架构提升了可扩展性、可维护性和可重用性。每个组件都只负责一项职责,因此可以根据需要更轻松地替换模型、更改检索机制或更新文档管道。
This modular architecture promotes scalability, maintainability, and reusability. Each component has a single responsibility, making it easier to swap out models, change the retrieval mechanism, or update the document pipeline as needed.
您可以放心运行您的 RAG 应用,并期待获得以下结果:
You can confidently run your RAG app and expect this:
在我们目前的实现方案中,RAG 系统是单租户的,这意味着它在单个共享环境中处理所有数据和用户交互。所有源文档都嵌入其中。将所有数据存储到一个单一的向量存储库中,检索过程在同一个共享文档索引上进行,无论哪个用户提交查询。
In our current implementation, the RAG system is single-tenant, meaning it handles all data and user interactions within a single shared environment. All source documents are embedded into a single vector store, and the retrieval process operates across the same shared document index regardless of which user submits a query.
相比之下,多租户 RAG 系统必须强制执行租户、组织、部门或个人用户之间的数据隔离。每个租户都将拥有自己独立的向量存储或共享存储中的命名空间,从而确保一个用户的数据和结果永远不会暴露给其他用户。系统必须在每次查询期间根据租户身份动态加载正确的向量存储和内存上下文。
In contrast, a multi-tenant RAG system must enforce data isolation between tenants, organizations, departments, or individual users. Each tenant would have its own isolated vector store or a namespace within a shared store, ensuring that one user’s data and results are never exposed to another. The system must dynamically load the correct vector store and memory context based on the tenant identity during each query.
使用本书 GitHub 代码库中提供的现有代码,并分析需要进行哪些具体的架构变更才能将此单租户 RAG 系统转换为安全、可扩展的多租户系统。重点关注向量存储分离、内存管理和请求级路由。此外,还可以选择性地说明用户身份验证或元数据标记如何支持这些变更。
Use the existing code, available in the GitHub repo of this book, and what specific architectural changes would be required to transform this single-tenant RAG into a secure, scalable multi-tenant system. Focus on vector store separation, memory handling, and request-level routing. Optionally, mention how user authentication or metadata tagging could support these changes.
本章简明扼要地介绍了如何使用 OpenAI 技术构建高级 RAG 系统。我们首先阐述了核心概念和 API,这些概念和 API 能够将语言模型无缝集成到实际应用场景中。然后,我们探讨了向智能体 AI 的演进,即能够进行推理和执行任务的自主系统,这标志着交互方式从静态交互转向动态、自适应的工作流程。
This chapter offered a concise yet thorough guide to building advanced RAG systems using OpenAI technologies. We began with core concepts and APIs that enable seamless integration of language models into real-world use cases. We then explored the evolution toward agentic AI, autonomous systems capable of reasoning and executing tasks, which marks a shift from static interactions to dynamic, adaptive workflows.
本次研究的重点是多文档查询,这对于聚合来自不同来源的上下文信息至关重要。我们提出了一种模块化、可扩展的 RAG 架构,该架构将 OpenAI 模型与 Faiss 相结合,实现混合检索,从而获得高相关性、灵活性和企业级性能。
A key focus was multi-document querying, essential for aggregating context from diverse sources. We presented a modular, scalable RAG architecture that combines OpenAI models with Faiss for hybrid retrieval, enabling high relevance, flexibility, and enterprise-grade performance.
下一章我们将探讨具有人机交互能力的智能体GenAI。它将指导读者构建能够进行检索、推理、行动和交互的决策感知型智能体。这包括集成工具使用、反馈循环和多智能体协作,并将RAG扩展到具有人类监督的动态交互式系统中。
In the next chapter, we will understand agentic GenAI with human-AI interaction. It will guide readers through building decision-aware agents that retrieve, reason, act, and interact. This includes integrating tool use, feedback loops, and multi-agent collaboration, extending RAG into dynamic, interactive systems with human oversight.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
随着生成式人工智能(GenAI )不断超越简单的查询-响应模式,智能体生成式人工智能(agentic GenAI)作为一种强大的架构方法应运而生,它能够实现结构化、动态和自主的推理。与传统的单步响应模型不同,智能体生成式人工智能系统旨在通过多个推理步骤进行规划、信息检索、工具运用和决策制定。本章将向读者介绍构建此类系统的基础概念,重点关注模块化、可扩展和多智能体架构。本章借鉴现实世界的模式,从顺序智能体到分层规划器,为构建能够像协调者一样思考和行动的智能体提供了全面的指导。
As generative AI (GenAI) continues to evolve beyond simple query-response paradigms, agentic GenAI emerges as a powerful architectural approach that enables structured, dynamic, and autonomous reasoning. Unlike traditional models that respond in a single-step, agentic GenAI systems are designed to plan, retrieve information, utilize tools, and make decisions across multiple reasoning steps. This chapter introduces readers to the foundational concepts of building such systems, focusing on modular, extensible, and multi-agent architectures. Drawing on real-world patterns, from sequential agents to hierarchical planners, this chapter provides a comprehensive guide to engineering agents that think and act like orchestrators.
您将学习如何使用 LangChain 的 ReAct 框架、LangGraph 和检索组件等工具来实现智能多智能体系统。这些智能体可以与 API 交互、查询向量数据库、利用内存,甚至可以与人机协作(HITL )。我们将使用 Python 将聚合器、循环和路由模式等可视化框架映射到代码,让您深入了解这些抽象概念的实现方式。通过掌握这些智能体模式和设计原则,您将能够开发出不仅能够生成信息,还能进行推理、检索和有目的地响应的 AI 系统。
You will learn how to use tools like LangChain’s ReAct framework, LangGraph, and retrieval components to implement intelligent multi-agent systems. These agents can interact with APIs, query vector databases, utilize memory, and even collaborate with human-in-the-loop (HITL). Visual frameworks such as aggregator, loop, and router patterns will be mapped to code using Python, giving you practical insight into how these abstract ideas are realized. By mastering these agentic patterns and design principles, you will gain the ability to develop AI systems that do not just generate, but reason, retrieve, and respond with purpose.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在通过探索智能体生成人工智能(GenAI)系统的架构、设计模式和实际应用,帮助读者深入理解智能体生成人工智能系统。读者将学习如何构建能够实现推理、工具使用、记忆集成和协作的多智能体工作流。本章还将介绍人机交互检索增强生成( HITL 检索增强生成, RAG )系统,并将传统人工智能智能体与智能体人工智能进行对比,重点强调编排和自适应规划。最终,读者将能够设计超越单步响应的智能系统,为在动态的真实环境中构建可扩展的自主人工智能应用奠定基础。
The objective of this chapter is to equip readers with a deep understanding of agentic GenAI systems by exploring their architecture, design patterns, and practical implementations. Readers will learn how to build multi-agent workflows that enable reasoning, tool use, memory integration, and collaboration. The chapter also introduces HITL retrieval-augmented generation (RAG) systems and contrasts traditional AI agents with agentic AI, emphasizing orchestration and adaptive planning. By the end, readers will be able to design intelligent systems that move beyond single-step responses, laying the foundation for scalable, autonomous AI applications in dynamic, real-world environments.
在第二章“深入探索多模态系统”中,我们介绍了多智能体系统的概念——即自主人工智能智能体协作解决复杂任务的系统。本节将探讨构成此类系统核心的设计模式。这些模式对于构建智能、模块化和可扩展的GenAI应用至关重要。通过理解和应用这些模式,开发者可以超越一次性生成模型,构建真正动态的、具有智能体的系统,使其能够进行规划、推理、检索、行动和学习。
In Chapter 2, Deep Dive into Multimodal Systems, we introduced the concept of multi-agent systems—systems where autonomous AI agents collaborate to solve complex tasks. In this section, we explore the design patterns that form the backbone of such systems. These patterns are essential for building intelligent, modular, and scalable GenAI applications. By understanding and applying them, developers can move beyond one-shot generation models and architect truly dynamic, agentic systems capable of planning, reasoning, retrieving, acting, and learning.
多智能体系统代表着人工智能系统从单体架构向分布式交互式架构的重大转变。这些系统中的每个智能体都可以是专业化的、自主的或相互依赖的,它们通过共享内存、工具和推理路径来协作完成复杂的流程。在实践中,这些系统是通过组合可重用的设计模式构建的,这些模式定义了智能体之间以及智能体与环境之间的交互方式。在接下来的章节中,我们将探讨经典和高级的设计模式,涵盖从简单的顺序流程到协作式、容错式和多模态推理系统。
Multi-agent systems represent a significant shift from monolithic AI systems to distributed, interactive architectures. Each agent in these systems can be specialized, autonomous, or interdependent, contributing to sophisticated workflows through shared memory, tools, and reasoning paths. In practice, these systems are built by combining reusable design patterns that define how agents interact with one another and the environment. In the following sections, we examine both classical and advanced patterns, ranging from simple sequential flows to collaborative, fault-tolerant, and multimodal reasoning systems.
并行模式构建了多个人工智能代理,使它们能够同时处理相同的输入或更大输入的不同组成部分。每个代理独立执行其任务。不受他人影响,最终结果是通过合并或汇总他们的个人成果而获得的。
The parallel pattern structures multiple AI agents to operate concurrently on either the same input or different components of a larger input. Each agent performs its task independently, without being influenced by others, and the final result is obtained by merging or aggregating their individual outputs.
结构与行为:所有代理同时触发。它们可以使用相同的输入(例如,共享的用户提示)或分段输入(例如,拆分的文档)。处理完成后,合并函数会将结果聚合为统一的输出。
Structure and behavior: All agents are triggered simultaneously. They may use the same input (e.g., a shared user prompt) or segmented parts (e.g., split documents). After processing, a merge function aggregates results into a unified output.
设计原理:当任务可以分解成独立的工作单元时,这种模式尤为有效。它通过并行处理最大限度地提高速度,并能充分利用不同智能体之间的专业化优势。
Design rationale: This pattern is particularly effective when tasks can be decomposed into independent units of work. It maximizes speed through parallelism and can exploit specialization among agents.
实际应用:
Practical application:
下图展示了一个并行代理编排工作流程,其中基于 LLM 的中央编排器将输入任务分配给专门的代理进行并行处理,然后再生成最终输出:
The following figure illustrates a parallel agentic orchestration workflow where a central LLM-based orchestrator distributes input tasks to specialized agents for parallel processing before generating the final output:
Figure 5.1: LLM orchestrator routes input to agents for collaborative output
这种顺序模式将各个代理连接成一个管道,其中每个代理的输出都成为下一个代理的输入。这就创建了一个多步骤的推理或转换过程。
The sequential pattern connects agents in a pipeline, where each agent's output becomes the next agent's input. This creates a multi-step reasoning or transformation process.
结构与行为:智能体 A |智能体 B |智能体 C,形成清晰有序的执行链。每一步都以前一步为基础,通常会增加抽象程度或细化输出。
Structure and behavior: Agent A | Agent B | Agent C, forming a clear, ordered chain of execution. Each step builds on the last, often increasing abstraction or refining output.
设计原理:适用于需要分层处理的工作流程。每个代理都可以执行简单的任务,从而形成易于管理、测试和解释的步骤。
Design rationale: Useful for workflows requiring layered processing. Each agent can perform a simple task, resulting in manageable, testable, and interpretable steps.
实际应用:
Practical application:
该图显示了顺序代理协作模式,其中 LLM 协调器将输入路由到一个代理,该代理将部分任务委托给另一个代理,然后生成最终输出:
This figure shows a sequential agent collaboration pattern where an LLM orchestrator routes input through one agent, which delegates part of the task to another agent before producing the final output:
Figure 5.2: Chained agent collaboration with LLM orchestration
在循环模式中,智能体通过反馈机制迭代处理输入。系统在每个循环周期中重新评估或改进结果,持续进行直至满足收敛条件。
In the loop pattern, agents iteratively process an input through a feedback mechanism. The system re-evaluates or refines results in each loop cycle, continuing until a convergence condition is met.
结构和行为:代理 A 产生输出 | 代理 B 评估 | 返回反馈 | 重复直到达到质量阈值或循环计数结束。
Structure and behavior: Agent A produces output | Agent B evaluates | feedback is returned | repeat until quality threshold is met or loop count ends.
设计理念:非常适合需要迭代改进、优化或从反馈中学习的任务。鼓励精细化改进,而非一次性生成。
Design rationale: Ideal for tasks involving iterative improvement, optimization, or learning from feedback. Encourages refinement over one-shot generation.
实际应用:
Practical application:
该图表示一个循环代理交互过程,其中代理在 LLM 协调器的指导下,通过迭代沟通协作改进结果,最终生成最终输出:
This figure represents a looped agent interaction, where agents collaboratively refine results through iterative communication before producing the final output, all under the direction of an LLM orchestrator:
Figure 5.3: LLM-guided agent loop for iterative task solving
该模式引入了一个中央路由代理,它根据内容、上下文或元数据动态地决定哪个下游代理应该处理传入的任务。
This pattern introduces a central router agent that dynamically decides which downstream agent should handle an incoming task based on content, context, or metadata.
结构和行为:路由器接收输入|分类或分析|发送给多个专门代理之一|返回结果。
Structure and behavior: Router receives input | classifies or analyzes | sends to one of many specialized agents | result is returned.
设计理念:支持模块化和条件逻辑。通过将决策与任务执行分离,提高了代码的可重用性和系统灵活性。
Design rationale: Supports modularity and conditional logic. By separating decision-making from task execution, it promotes reusability and system flexibility.
实际应用:
Practical application:
以下架构在 LLM 协调器和下游代理之间引入了一个路由代理,从而能够根据输入特征进行智能任务委派:
The following architecture introduces a router agent between the LLM orchestrator and downstream agents, enabling smart task delegation based on input characteristics:
Figure 5.4: Router agent directs tasks to specialized agents for output generation
聚合器模式将来自多个来源的输入或输出合并成一个连贯的结果。它关注的不是并行执行,而是综合和整合。
The aggregator pattern combines inputs or outputs from multiple sources into a coherent result. It focuses not on parallel execution but on synthesis and consolidation.
结构和行为:多个输入|聚合代理|规范化、合并或汇总数据|返回单个输出。
Structure and behavior: Multiple inputs | aggregator agent | normalizes, merges, or summarizes data | returns single output.
设计原理:当需要多种视角或数据来源以获得全面结果时,此方法非常有用。通过冗余设计提高模型的稳健性。
Design rationale: Useful when diverse perspectives or data sources are required for a comprehensive output. Promotes robustness through redundancy.
实际应用:
Practical application:
下图描绘了一种聚合器模式,其中多个输入由基于 LLM 的协调器统一,然后再传递给代理进行最终处理:
The following figure depicts an aggregator pattern, where multiple inputs are unified by an LLM-based orchestrator before being passed to an agent for final processing:
Figure 5.5: Orchestrator aggregates multiple inputs for unified agent execution
在这种模式下,各个代理完全或部分连接,并在没有集中控制的情况下自由通信。它体现了一种去中心化的网状拓扑结构。
In this pattern, agents are fully or partially connected and communicate freely without centralized control. It reflects a decentralized, mesh-like topology.
结构与行为:智能体与网络中的任何对等体交换信息,形成一个开放的协作环境。协调是涌现的。
Structure and behavior: Agents exchange messages with any peer in the network, forming an open, collaborative environment. Coordination is emergent.
设计理念:最适合需要自主性、适应性和同伴学习的复杂环境。适用于分布式问题解决。
Design rationale: Best for complex environments requiring autonomy, adaptability, and peer learning. Suited for distributed problem-solving.
实际应用:
Practical application:
下图展示了一种网络模式,其中 LLM 协调器激活了一个协作代理网,每个代理都对彼此的输出做出贡献并在此基础上进行构建:
The following figure illustrates a network pattern where an LLM orchestrator activates a collaborative mesh or network of agents, each contributing to and building upon one another’s outputs:
Figure 5.6: Networked agents collaborate for enriched output generation
层级模式将智能体组织成不同抽象层次。高层智能体(规划者或监督者)将任务委派给中层或低层智能体,由后者执行任务。
The hierarchical pattern organizes agents into layers of abstraction. High-level agents (planners or supervisors) delegate tasks to mid or low-level agents, who execute them.
结构和行为:规划代理 | 任务委派 | 工作代理 | 聚合输出向上返回层次结构。
Structure and behavior: Planner agent | task delegation | worker agents | aggregated output returned up the hierarchy.
设计理念:有助于明确责任和控制权,支持任务分解和团队协作。
Design rationale: Encourages clarity of responsibility and control. Supports task decomposition and team-like collaboration.
实际应用:
Practical application:
这种架构代表了一种分层模式,其中协调器通过协调代理路由输入,然后协调代理将任务委派给专门的代理以实现并行输出:
This architecture represents a hierarchical pattern where an orchestrator routes input through a coordinating agent, which then delegates tasks to specialized agents for parallel outputs:
Figure 5.7: Hierarchical agent chain for distributed output generation
这种模式在关键时刻将人类决策引入系统,允许代理暂停并等待用户输入或验证后再继续。
This pattern introduces human decision-making into the system at critical junctures, allowing agents to pause and await user input or validation before continuing.
结构和行为:代理执行停止|人工审核或提供输入|执行恢复。
Structure and behavior: Agent execution halts | human reviews or provides input | execution resumes.
设计原理:在敏感领域(法律、医疗保健、伦理)中,为了安全、正确或规范,需要进行人为监督,因此至关重要。
Design rationale: Essential in sensitive domains (legal, healthcare, ethics) where human oversight is required for safety, correctness, or regulation.
实际应用:
Practical application:
该图突出了一个 HITL 代理框架,其中 LLM 协调的代理生成多个输出,而人则提供监督和最终判断:
This figure highlights a HITL agent framework, where the LLM-orchestrated agents generate multiple outputs and the human provides oversight and final judgment:
Figure 5.8: Hierarchical agent chain for distributed output generation
多个代理访问通用工具包(例如 API、搜索引擎或向量数据库),以保持各项任务的一致性和效率。
Multiple agents access a common toolkit, such as APIs, search engines, or vector databases, to maintain consistency and efficiency across tasks.
结构和行为:代理 | 共享接口层 | 工具/数据库/API。
Structure and behavior: Agents | shared interface layer | tool/database/API.
设计理念:提高模块化程度,减少重复工作。支持集中式更新和监控。
Design rationale: Promotes modularity and reduces duplication. Allows centralized updates and monitoring.
实际应用:
Practical application:
该架构展示了一种先进的 HITL 系统,其中代理不仅相互协作,而且还使用共享工具在人工审核之前完善其输出。
This architecture demonstrates an advanced HITL system where agents not only collaborate but also use shared tools to refine their outputs before human review.
图 5.9:共享工具增强型智能体在人类监督下协作,以实现最佳输出
Figure 5.9: Shared tool-augmented agents collaborate under human supervision for optimized output
这种模式为代理提供了各种工具和数据库,可以实时提供、丰富或持久化知识,从而帮助进行智能决策。
This pattern surrounds agents with tools and databases that provide, enrich, or persist knowledge in real-time, aiding intelligent decision-making.
结构和行为:代理流程 | 外部工具提供转换 | 数据存储或前馈。
Structure and behavior: Agent processes | external tool provides transformation | data stored or fed forward.
设计理念:将计算与结构化持久化相结合。支持涉及状态跟踪和增强的复杂工作流程。
Design rationale: Combines computation with structured persistence. Supports complex workflows involving state tracking and enrichment.
实际应用:
Practical application:
该架构展示了可通过访问矢量数据库和网络搜索引擎等外部工具来增强代理的功能,所有这些都通过 LLM 协调器进行协调,并由人工反馈指导:
This architecture showcases agents enhanced with access to external tools like vector databases and web search engines, all coordinated through an LLM orchestrator and guided by human feedback:
图 5.10:代理使用向量和网络搜索工具生成丰富的、经人工验证的输出
Figure 5.10: Agents use vector and web search tools to generate enriched, human-verified output
在这种模式下,代理会根据工具处理后的见解更新记忆,从而实现跨会话的学习和个性化。
In this pattern, agents update memory based on processed insights from tools, enabling learning and personalization across sessions.
结构和行为:代理或工具提取信号|记忆模块更新|未来的决策受记忆状态影响。
Structure and behavior: Agent or tool extracts signal | memory module is updated | future decisions influenced by memory state.
设计原理:支持能够从历史、偏好和互动中学习的自适应系统。
Design rationale: Supports adaptive systems that learn from history, preferences, and interactions over time.
实际应用:
Practical application:
该图展示了一个人工智能编排框架,其中用户的输入由编排器管理,编排器使用语言模型和专门的代理来执行诸如向量搜索、网络搜索和记忆检索之类的任务,从而促进迭代的、HITL 输出:
This figure illustrates an AI orchestration framework where a user's input is managed by an orchestrator, which uses language models and specialized agents to conduct tasks like vector search, web search, and memory retrieval, facilitating iterative, HITL outputs:
这种模式将系统分为一个规划代理(负责制定策略)和一个或多个执行代理(负责根据计划执行行动)。
This pattern divides the system into a planning agent that determines the strategy and one or more executors that carry out actions based on the plan.
结构和行为:规划者思考目标|制定计划|执行者逐步行动|反馈返回给规划者。
Structure and behavior: Planner reasons over goal | forms a plan | executors act step-by-step | feedback returned to planner.
设计理念:模拟人类认知(先思考后行动)。实现复杂的多步骤推理和可追溯的执行。
Design rationale: Mimics human cognition (thinking before acting). Enables complex, multi-step reasoning and traceable execution.
实际应用:
Practical application:
下图展示了一个多智能体人工智能工作流程,图中显示了用户输入如何通过协调器和规划器智能体路由到专门的智能体,以执行诸如向量搜索和网络搜索之类的任务,并将输出输入到内存中进行迭代改进:
The following figure demonstrates a multi-agent AI workflow, showing how user input is routed via an orchestrator and planner agent to specialized agents for tasks such as vector search and web search, with outputs feeding into memory for iterative improvement:
该模式包含一个验证器或评论器代理,用于审查并批准或要求修改另一个代理的输出。
This pattern includes a validator or critic agent that reviews and either approves or requests revisions of another agent’s output.
结构和行为:生产者代理|输出经验证者审核|已批准或已修改|最终输出。
Structure and behavior: Producer agent | output reviewed by validator | approved or revised | final output.
设计原理:提高可靠性,减少幻觉,并提供质量控制。起到内部反馈回路的作用。
Design rationale: Improves reliability, reduces hallucinations, and provides quality control. Acts as an internal feedback loop.
实际应用:
Practical application:
下图展示了一个基于代理的 AI 工作流程,其中包含一个协调器,该协调器处理输入并通过评论代理将任务委派给代理,通过反馈循环和代理协作来确保质量:
The following figure illustrates an agent-based AI workflow featuring an orchestrator that processes input and delegates tasks to agents through a critic agent, ensuring quality via feedback loops and agent collaboration:
图 5.13:基于代理的 AI 工作流程,采用评论家介导的任务执行和迭代代理反馈
Figure 5.13: Agent-based AI workflow with critic-mediated task execution and iterative agent feedback
具有不同目标或观点的主体通过反复沟通来达成决策或解决方案。这模拟了谈判、妥协或博弈论行为。
Agents with differing goals or perspectives communicate iteratively to reach a decision or resolution. This simulates negotiation, compromise, or game-theoretic behavior.
结构和行为:代理人交换报价或提议|状态根据偏好演变|达成一致或失败。
Structure and behavior: Agents exchange offers or proposals | state evolves based on preferences | agreement or failure.
设计原理:模拟现实世界中利益相关者的互动。适用于仿真或分布式决策系统。
Design rationale: Models real-world stakeholder interaction. Useful in simulations or distributed decision systems.
实际应用:
Practical application:
该图描绘了人工智能代理之间的协商工作流程。协商代理向每个代理发出两个信号:排名靠前的代理拒绝第一个信号但接受第二个信号,而排名靠后的代理则拒绝两个信号。这种选择性信号传递最终导致只有排名靠前的代理对最终输出做出贡献。
This figure depicts a negotiation workflow among AI agents. The negotiator agent issues two signals to each agent: the top agent declines the first signal but accepts the second, while the bottom agent declines both signals. This selective signaling ultimately results in only the top agent contributing to the final output:
这种模式使用多个代理来处理不同类型的输入或输出(例如,文本、图像、音频),并将它们的见解结合起来,形成统一的结果。
This pattern uses multiple agents to process different types of input or output (e.g., text, image, audio), and combines their insights into a unified result.
结构和行为:输入按模态路由|模态特定处理|融合代理合并结果。
Structure and behavior: Input routed based on modality | modality-specific processing | fusion agent combines results.
设计原理:支持多感官人工智能系统,使其能够跨格式进行推理并提供更丰富的见解。
Design rationale: Enables multi-sensory AI systems that can reason across formats and deliver richer insights.
实际应用:
Practical application:
下图展示了一个协调的 AI 系统,其中协调器利用语言模型将用户提供的文本或图像路由到专门的代理,这些代理处理后的输出被合并以获得统一的结果:
The following figure visualizes a coordinated AI system where an orchestrator leverages language models to route user-provided text or images to specialized agents, whose processed outputs are merged for a unified result:
多个智能体提供答案,最终结果根据共识、置信度或投票算法选出。
Multiple agents offer answers, and a final result is chosen based on consensus, confidence, or voting algorithms.
结构和行为:代理并行处理|提交预测或评估|聚合器计算最佳结果。
Structure and behavior: Agents process in parallel | submit predictions or evaluations | aggregator computes best result.
设计原理:提高可靠性和稳健性。减少单一数据源带来的偏差。
Design rationale: Boosts reliability and robustness. Reduces bias from a single source of truth.
实际应用:
Practical application:
下图展示了一个多智能体人工智能决策框架,其中智能体对提出的解决方案进行投票,协调器选择获得共识批准的输出,以确保高质量的结果:
The following figure illustrates a multi-agent AI decision-making framework where agents vote on proposed solutions, with the orchestrator selecting the consensus-approved output to ensure high-quality results:
图 5.16:多智能体 AI 投票系统简化决策过程以实现最佳输出
Figure 5.16: Multi-agent AI voting system streamlines decision-making for optimal output
主管代理负责监控和协调一组工作代理,并在需要时介入以指导、纠正或优化他们的行动。
The supervisor agent monitors and coordinates a group of working agents, stepping in when needed to guide, correct, or optimize their actions.
结构和行为:工作代理自主运行|主管观察指标或行为|必要时触发纠正。
Structure and behavior: Worker agents operate autonomously | supervisor observes metrics or behaviors | triggers correction if needed.
设计原理:在保持系统完整性的同时,允许在较低层级自主运行。
Design rationale: Maintains high system integrity while allowing autonomous operation at lower levels.
实际应用:
Practical application:
该图展示了人工智能多智能体系统中常见的监督者-下属模式。在这个模式中,中央监督者(协调者)智能体接收用户输入,将任务委派给多个专门的下属智能体,并收集它们的输出。监督者集中控制通信、决策和任务分配,确保工作高效可靠地进行。下属智能体专注于执行特定任务并向监督者汇报,从而实现精简的协调、监控和故障恢复。
This figure represents the supervisor-subordinate pattern common in AI multi-agent systems, where a central supervisor (orchestrator) agent receives the user's input, delegates tasks to multiple specialized subordinate agents, and gathers their outputs. The supervisor centrally controls communication, decision-making, and task assignment, ensuring that work progresses efficiently and reliably. Subordinates focus on executing specific tasks and report back to their supervisor, enabling streamlined coordination, monitoring, and recovery if any agent fails:
这种以弹性为中心的模式引入了一个监视代理,该代理会观察系统健康状况,并在发生故障或延迟时启动恢复。
This resilience-focused pattern introduces a watchdog agent that observes system health and initiates recovery if failures or delays occur.
结构和行为:被动监控|检测故障或超时|重新运行、升级或切换路径。
Structure and behavior: Passive monitoring | detect failure or timeout | rerun, escalate, or switch paths.
设计原理:提高系统的稳健性、正常运行时间和可恢复性。这对于生产级系统至关重要。
Design rationale: Improves robustness, uptime, and system recoverability. Crucial in production-grade systems.
实际应用:
Practical application:
该图展示了一个强大的AI编排框架,其中编排器利用LLM将传入的任务委派给多个专业代理。每个代理都与一个监控模块配对,该模块是一个自主监控器,用于确保任务的可靠性和质量:
This figure depicts a robust AI orchestration framework where an orchestrator leverages an LLM to delegate incoming tasks to multiple specialized agents. Each agent is paired with a watchdog module, an autonomous monitor ensuring task reliability and quality:
这种规划者-执行者模式的变体包含了时间约束、调度逻辑和截止日期意识。
This variation of the planner-executor pattern incorporates time constraints, scheduling logic, and deadline awareness.
结构和行为:计划包括时间戳或持续时间|执行器根据计划运行任务|基于时间的决策会影响流程。
Structure and behavior: Plan includes timestamps or durations | executors run tasks based on schedule | time-based decisions affect flow.
设计原理:对实时或延迟执行场景至关重要。支持长期规划。
Design rationale: Essential for real-time or delayed execution scenarios. Supports long-horizon planning.
实际应用:
Practical application:
下图描绘了人工智能代理系统中的主管-下属模式,重点强调了时间维度:主管在多个不同的阶段或时间步长内向下属代理下达任务。在每个阶段,下属代理执行特定操作并报告其输出;主管评估进度,更新策略,并根据累积结果和不断变化的环境分配后续任务。这种循环的、时间感知的交互确保系统能够动态地调整和协调代理在多阶段流程中的工作。
The following figure portrays the supervisor-subordinate pattern in AI agent systems, this time emphasizing the temporal dimension: the supervisor issues tasks to subordinate agents across several distinct phases or time steps. At each phase, subordinates execute specific actions and report their outputs; the supervisor assesses progress, updates the strategy, and delegates subsequent tasks based on cumulative results and changing context. This cyclical, time-aware interaction ensures that the system dynamically adapts and coordinates agent efforts throughout multi-stage processes.
在探索了从简单的顺序链到复杂的分层、验证器和共识框架等19种多智能体系统设计模式之后,我们现在已准备好将理论应用于实践。这些模式的丰富性不仅体现在学术层面,更构成了构建智能、模块化和可扩展的GenAI系统的架构骨架。
Having explored the full spectrum of 19 multi-agent system design patterns, from simple sequential chains to complex hierarchical, validator, and consensus-based frameworks, we are now ready to transition from theory to practice. The richness of these patterns is not just academic; it forms the architectural backbone for building intelligent, modular, and scalable GenAI systems.
在本节中,我们将通过构建一个可用于生产环境的真实 HITL 多智能体 RAG 系统来实践这些模式。该实现将利用 LangGraph 的状态图编排功能,根据特定任务的逻辑和反馈,动态地在智能体和工具之间路由控制。
In this section, we will bring these patterns to life by constructing a real-world, production-ready HITL multi-agent RAG system. This implementation will utilize the StateGraph orchestration capabilities of LangGraph to dynamically route control across agents and tools based on task-specific logic and feedback.
该系统将着重强调模块化、可扩展性和完全本地执行,不依赖任何外部 API 或 OpenAI 服务。我们将采用以下方式:
This system will emphasize modularity, extensibility, and full local execution, with no dependency on external APIs or OpenAI services. Instead, we will use the following:
我们将把系统架构设计成一个多智能体工作流,利用 LangGraph 的状态图连接承担不同职责的智能体:检索、评分、生成和人工审核。每个组件都将采用模块化设计,以便于调试、定制和重用。
We will architect the system as a multi-agent workflow, using LangGraph's StateGraph to connect agents with different responsibilities: retrieval, grading, generation, and human oversight. Each component will be built in a modular fashion to enable debugging, customization, and reuse.
接下来不仅演示了智能体推理,还提供了一个蓝图,展示了现实世界中的 GenAI 应用如何将自主性与责任性、推理能力与可靠性以及速度与安全性结合起来。现在,让我们一起了解这个智能系统的架构、文件夹结构以及逐步实现过程:
What follows is not just a demonstration of agentic reasoning, but a blueprint for how real-world GenAI applications can combine autonomy with accountability, reasoning with reliability, and speed with safety. Let us now walk through the architecture, folder structure, and step-by-step implementation of this intelligent system:
图 5.20: 该图显示了带有 HITL 的 RAG 的文件夹结构。
Figure 5.20: The figure shows the folder structure of a RAG with HITL
在生产级人工智能系统中,最关键的设计要素之一是信任,它确保输出结果准确、可靠且符合上下文。这在教育、医疗保健、法律研究或企业文档等场景中尤为重要,因为错误或误导性的回复可能会造成严重后果。为了解决这个问题,我们的系统集成了一种名为 HITL 的设计模式。
One of the most critical design elements in production-grade AI systems is trust, ensuring that the outputs are accurate, grounded, and contextually appropriate. This is especially important in scenarios like education, healthcare, legal research, or enterprise documentation, where an incorrect or misleading response can have serious consequences. To address this, our system integrates a design pattern known as HITL.
简而言之,HITL(人机交互)意味着人工智能并非始终自主运行。相反,在特定的决策点,例如生成答案后,系统会暂停并请求人工验证。这确保了在人工智能的回复最终确定或被执行之前,人们有机会批准、拒绝或要求重新生成回复。
In simple terms, HITL means that the AI does not always operate autonomously. Instead, at specific decision points, such as after generating an answer, the system pauses and asks for human validation. This ensures that a person has the opportunity to approve, reject, or request regeneration of the AI's response before it is finalized or acted upon.
在我们的实现中,HITL 逻辑是 LangGraph 工作流程的一部分。RAG 代理使用混合检索和本地 LLM 推理生成答案后,系统会打印结果及其来源。然后,它会显式调用一个函数来提示用户:
In our implementation, the HITL logic is part of the LangGraph workflow. After the RAG agent produces an answer using hybrid retrieval and local LLM reasoning, the system prints the result along with its sources. It then explicitly calls a function that prompts the human user:
def human_approval_required():
def human_approval_required():
return input("\n是否同意答案?(是/否):").strip().lower() != "yes"
return input("\nApprove the answer? (yes/no): ").strip().lower() != "yes"
如果用户输入除“是”以外的任何内容,系统将假定答案不令人满意。重试循环允许最多三次重新生成答案,之后将停止并显示类似“多次尝试后答案被拒绝”的消息。
If the user types anything other than yes, the system assumes the answer is unsatisfactory. A retry loop allows up to three regeneration attempts before it halts with a message like answer rejected after multiple attempts.
对于人工智能领域的新手来说,人机交互(HITL)是弥合人工智能自主性与人类判断之间鸿沟的关键机制。它将负责任的人工智能真正付诸实践,不再仅仅是一个流行语,而是作为一项嵌入系统架构中的切实保障措施。
For new AI practitioners, HITL is an essential mechanism to bridge the gap between AI autonomy and human judgment. It brings responsible AI into action, not just as a buzzword, but as a practical safeguard embedded within the system's architecture.
让我们来详细分析一下HITL在这个系统中是如何实现的:
Let us unpack how HITL is implemented in this system:
def human_approval_required():
return input("\n是否同意答案?(是/否):").strip().lower() != "yes"
def human_approval_required():
return input("\nApprove the answer? (yes/no): ").strip().lower() != "yes"
重试次数 = 3
对于尝试次数在范围(重试次数)内:
...
如果不是需要人工批准:
返回结果["答案"]
print("\n正在尝试回答同一个问题...")
retries = 3
for attempt in range(retries):
...
if not human_approval_required():
return result["answer"]
print("\nRetrying with same question...")
返回“多次尝试后答案被拒绝”。
return "Answer rejected after multiple attempts."
让我们来了解一下这为什么重要:
Let us understand why this matters:
HITL 的这项功能引入了额外的控制和问责机制,使系统更适合实际部署。它将纯粹的自主代理系统转变为协作式工作流程,使人类和人工智能能够共同协作,产生可靠的结果。
This HITL feature introduces an extra layer of control and accountability, making the system more suitable for real-world deployment. It transforms a purely autonomous agentic system into a collaborative workflow, where humans and AI work together to produce reliable results.
现在我们已经了解了 HITL 的架构及其在确保 AI 输出可信度方面的作用,接下来让我们探讨一下该系统在代码中的实现方式。以下部分将逐步介绍每个模块的完整实现,重点阐述本地嵌入、向量搜索、混合检索、ReAct 提示以及基于 LangGraph 的编排如何协同工作,从而构建一个智能、可控且完全本地化的 RAG 流水线。
Now, that we understand the architecture and the role of HITL in ensuring trustworthy AI output, let us explore how this system is implemented in code. The following section walks through the full implementation of each module, step-by-step, highlighting how local embeddings, vector search, hybrid retrieval, ReAct prompting, and LangGraph-based orchestration come together to power an intelligent, controllable, and fully local RAG pipeline.
此实现演示了一个完整的 HITL RAG 系统,该系统由 LangChain 组件编排,并设计为完全本地执行。系统首先解析本地 PDF 文档并将其分块,然后使用本地生成的嵌入将这些分块存储在持久化矢量数据库 (Chroma) 中。
This implementation demonstrates a complete HITL RAG system, orchestrated with LangChain components and designed for full local execution. The system begins by parsing and chunking a local PDF document, then storing those chunks in a persistent vector database (Chroma) using locally generated embeddings.
混合检索器结合了BM25关键词搜索和向量相似度,以识别与用户查询相关的语块。检索到的上下文信息被传递给ReAct风格的提示链,使本地语言模型能够逐步推理,最终生成简洁的答案。
A hybrid retriever combines BM25 keyword search and vector similarity to identify relevant chunks in response to user queries. The retrieved context is passed to a ReAct-style prompting chain that enables the local language model to reason step-by-step before generating a concise answer.
该系统能够维护对话记忆,从而确保跨多个用户输入的上下文连续性。生成答案后,系统会调用一个 HITL 函数,该函数会暂停以请求用户确认。如果响应被拒绝,系统最多会重试三次,然后优雅地终止流程。
The system maintains conversational memory, enabling contextual continuity across multiple user inputs. After generating an answer, the system invokes a HITL function that pauses to request user approval. If the response is rejected, the system retries up to three times before gracefully terminating the flow.
该架构采用模块化和可扩展设计,使其适用于对答案准确性、可追溯性和人工监督要求极高的企业级应用。
This architecture is modular and scalable, making it suitable for enterprise-grade applications where answer accuracy, traceability, and human oversight are essential.
完整的源代码和文件结构请参考 GitHub 仓库。
For the complete source code and file structure, refer to the GitHub repository.
在前一节中,我们探讨了一种 HITL RAG 架构,其中系统会在最终确定答案之前暂停以等待用户验证。虽然这允许进行监督,但其结构仍然很大程度上是线性的、单体的,所有逻辑都集中在一个链中。
In the previous section, we explored a HITL RAG architecture where the system paused for user validation before finalizing any answer. While this allowed for oversight, the structure was still largely linear and monolithic, with all logic centralized in a single chain.
为了真正符合多智能体设计原则,我们现在将 RAG 流水线分解为模块化的、可交互的智能体,每个智能体负责特定的角色。这些智能体包括:
To truly align with multi-agent design principles, we now decompose the RAG pipeline into modular, interacting agents, each responsible for a specific role. These agents include:
此设计使用 LangGraph 的状态图来编排工作流,具有清晰的转换和条件路由。与之前的实现不同,每个代理在逻辑中都是隔离的。但通过图进行协调,确保了模块化、可重用性和透明性。重试逻辑也已嵌入:如果人工审核未通过响应,生成步骤最多可以重新执行三次。
This design uses LangGraph’s StateGraph to orchestrate the workflow, with clear transitions and conditional routing. Unlike the previous implementation, each agent is isolated in logic but coordinated through the graph, ensuring modularity, reusability, and transparency. Retry logic is also embedded: the generation step can re-execute up to three times if the human does not approve the response.
通过这些结构性变化,我们现在实现了真正的多智能体 HITL RAG 系统,该系统既可在本地部署,又可由人控制。
With these structural changes, we now achieve a true multi-agent HITL RAG system, which is both locally deployable and human-controllable.
图 5.21展示了一个增强了智能体组件的 HITL RAG 架构。该流程始于用户查询,查询结果与填充了文档嵌入的向量数据库进行匹配。文档首先被分块,并添加元数据,然后使用嵌入模型进行嵌入。混合检索代理根据查询提取相关块,结果生成代理生成响应。响应随后进入人工反馈循环,HITL 代理会批准输出结果,或者触发最多三次的重试机制。如果响应仍然不令人满意,则拒绝该响应;否则,将其返回给用户。
Figure 5.21 illustrates a HITL RAG architecture enhanced with agentic components. The process begins with a user query, which is matched against a vector database populated with document embeddings. Documents are first chunked with metadata and embedded using an embedding model. A hybrid retrieval agent fetches relevant chunks based on the query, and a result generation agent synthesizes a response. The response then enters a human feedback loop, where a HITL agent either approves the output or triggers a retry mechanism up to three times. If the response remains unsatisfactory, it is rejected; otherwise, it is returned to the user.
图 5.21:具有代理反馈回路的端到端 HITL RAG 工作流程
Figure 5.21: End-to-end HITL RAG workflow with agentic feedback loop
有关完整的源代码和文件结构,请参阅Chapter_5_code.ipynb ,多智能体人机交互。
For the complete source code and file structure, refer to Chapter_5_code.ipynb, multi-agent human-in-the-loop.
检索代理负责根据用户的问题获取相关的文档片段。它使用了一种结合了BM25和向量相似度的混合检索器。
The retrieval agent is responsible for fetching relevant document chunks based on the user’s question. It uses a hybrid retriever that combines BM25 and vector similarity.
检索代理:
Retrieval agent:
def retrieval_agent(state):
def retrieval_agent(state):
返回 {"documents": retriever.get_relevant_documents(state["question"])}
return {"documents": retriever.get_relevant_documents(state["question"])}
该生成代理使用类似 ReAct 的提示链合成响应,并根据检索到的上下文逐步推理。它还会附加来源引用,以确保可追溯性,并为生成的答案提供透明的依据。
The generation agent synthesizes a response using a ReAct-style prompting chain, reasoning step-by-step over the retrieved context. It also attaches source citations to ensure traceability and provide transparent grounding for the generated answers.
生成代理:
Generation agent:
def generation_agent(state):
def generation_agent(state):
result = rag_chain.invoke({"question": state["question"]})
result = rag_chain.invoke({"question": state["question"]})
返回 {
return {
"答案": result["答案"],
"answer": result["answer"],
"source_documents": result.get("source_documents", [])
"source_documents": result.get("source_documents", [])
}
}
人工反馈循环在循环中引入了审批步骤。每次生成答案后,系统都会提示用户进行验证——允许用户批准、拒绝或要求重新生成——从而实现可控监督和迭代改进。
The human feedback loop introduces an approval step into the loop. After each generated answer, it prompts for user validation—allowing humans to approve, reject, or request re-generation—enabling controlled oversight and iterative refinement.
人脑反馈循环:
Human feedback loop:
def human_feedback_agent(state):
def human_feedback_agent(state):
已批准 = 不需要人工审批()
approved = not human_approval_required()
返回 {"已批准": 已批准}
return {"approved": approved}
如果用户拒绝该答案,系统将返回给生成代理进行重试,最多尝试三次。如果用户接受该答案,则答案最终确定并返回。
If the user rejects the answer, the system loops back to the generation agent for a retry, up to a maximum of three attempts. If the user approves, the answer is finalized and returned.
LangGraph 的状态图管理代理之间的流程。它定义了一个有向图,并利用该有向状态图协调整个流程。它对代理的执行进行排序,从数据检索、生成到验证,并根据用户反馈动态路由,支持循环重试和基于审批逻辑的优雅退出。
LangGraph’s StateGraph manages the flow across agents. It defines a directed graph, and LangGraph orchestrates the entire pipeline using a directed state graph. It sequences agent execution from retrieval to generation to validation, dynamically routing based on human feedback, enabling looped retries and graceful exits based on approval logic.
使用 LangGraph 进行编排:
Orchestration using LangGraph:
工作流 = 状态图(GraphState)
workflow = StateGraph(GraphState)
workflow.add_node("retrieve", retrieval_agent)
workflow.add_node("retrieve", retrieval_agent)
workflow.add_node("generate", generation_agent)
workflow.add_node("generate", generation_agent)
workflow.add_node("validate", human_feedback_agent)
workflow.add_node("validate", human_feedback_agent)
workflow.set_entry_point("retrieve")
workflow.set_entry_point("retrieve")
workflow.add_edge("检索", "生成")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("生成", "验证")
workflow.add_edge("generate", "validate")
workflow.add_conditional_edges(
workflow.add_conditional_edges(
“证实”,
"validate",
lambda 状态:如果 state.get("approved") 为 "end",否则为 "generate",
lambda state: "end" if state.get("approved") else "generate",
{
{
“结束”:结束,
"end": END,
"生成": "生成"
"generate": "generate"
}
}
)
)
最终生成的图表会被编译,并在main.py文件中用于交互式地处理用户输入。
The final graph is compiled and used in the main.py file to handle user input interactively.
这种架构体现了本章前面“构建智能体GenAI系统”部分讨论的智能体设计原则:每个智能体都是独立的、可测试的、可扩展的,从而为智能检索系统提供了灵活而稳健的基础。人工验证的集成确保了系统不仅能够回答问题,而且能够负责任地回答问题。
This architecture exemplifies the agentic design principles discussed earlier in the chapter section: Architecting agentic GenAI systems, each agent is isolated, testable, and extensible, enabling a flexible and robust foundation for intelligent retrieval systems. The integration of human validation ensures that the system not only answers, but answers responsibly.
要真正理解本章探讨的架构设计模式,必须区分人工智能代理和更高级的智能体人工智能范式。尽管这两个术语有时可以互换使用,但它们代表了人工智能在能力、自主性和系统协调方面的根本不同层次。
To truly appreciate the architectural design patterns we have explored in this chapter, it is essential to distinguish between AI agents and the more advanced paradigm of agentic AI. Although these terms are sometimes used interchangeably, they represent fundamentally different levels of capability, autonomy, and system coordination in AI.
人工智能代理是自主软件程序,旨在以最少的人工干预执行特定任务。这些系统擅长处理狭窄且定义明确的领域,例如回答客户服务咨询、安排会议或从API检索特定数据。它们的行为通常是被动的,对输入或触发条件做出响应,并且通常遵循线性、单步的执行模式。虽然它们可以使用API或数据库等工具,但它们的自主性通常局限于特定边界,无法进行更高层次的规划或协同推理。
AI agents are autonomous software programs designed to perform specific tasks with minimal human intervention. These systems excel in narrow, well-defined domains such as answering customer service queries, scheduling meetings, or retrieving specific data from APIs. Their behavior is typically reactive, responding to input or triggers, and they often follow a linear, single-step execution pattern. While they can use tools like APIs or databases, their autonomy is generally confined to specific boundaries and does not extend to higher-order planning or collaborative reasoning.
相比之下,智能体人工智能指的是一个更为复杂的系统,它由多个人工智能体组成,这些智能体协同工作以解决更高阶的问题。这些系统超越了简单的执行,而是专注于目标设定、高级规划以及跨多个步骤的协调。智能体人工智能体现了多智能体协作、用于情境感知的持久记忆以及基于不断变化的环境进行自适应决策等特征。与独立运行的传统人工智能体不同,智能体人工智能系统以协调网络的形式运行,其中的智能体可以共享信息、委派任务并动态地适应新的目标或环境。
In contrast, agentic AI refers to a more complex system composed of multiple AI agents working collaboratively to solve higher-order problems. These systems go beyond execution and instead focus on goal-setting, advanced planning, and orchestration across multiple steps. Agentic AI embodies characteristics such as multi-agent collaboration, persistent memory for contextual awareness, and adaptive decision-making based on evolving conditions. Unlike traditional AI agents that operate independently, agentic AI systems function as coordinated networks where agents can share information, delegate tasks, and adapt to new goals or contexts dynamically.
智能体人工智能的关键架构转变之一,是从孤立的任务执行转向系统级的协调。在这种架构中,更高级别的控制器(或协调器智能体)负责协调各个专业智能体的行为,使系统能够将复杂的目标分解为可管理的子任务。每个专业智能体都为实现总体目标做出贡献,而协调器则整合它们的输出,最终达成一致且目标导向的结果。
One of the key architectural shifts in agentic AI is the movement from isolated task execution to system-level orchestration. Here, a higher-level controller, or an orchestrator agent, coordinates the behavior of specialized agents, enabling the system to decompose complex goals into manageable subtasks. Each specialized agent contributes to a portion of the overall objective, and the orchestrator integrates their outputs to achieve coherent, goal-directed outcomes.
此外,虽然人工智能代理通常依赖于针对特定任务定制的基于规则或监督学习,但智能体人工智能则利用更复杂的学习策略,例如强化学习、元学习或混合方法,从而能够适应更广泛的任务领域。这种适应性在供应链优化、虚拟项目管理和企业自动化等应用中至关重要,因为在这些应用中,静态响应不足以应对挑战,需要动态的目标设定和推理。
Additionally, while AI agents often rely on rule-based or supervised learning tailored to narrow tasks, agentic AI leverages more sophisticated learning strategies such as reinforcement learning, meta-learning, or hybrid approaches that allow for adaptation across broader task domains. This adaptability is crucial in applications like supply chain optimization, virtual project management, and enterprise automation, where static responses are insufficient, and dynamic goal-setting and reasoning are required.
智能体人工智能还强调持久记忆,这是一种共享上下文,使智能体能够记住之前的交互、跟踪依赖关系并随着时间的推移更新策略。这种记忆形式不仅是一种技术特性,更是一种战略赋能因素,它使智能体能够……在彼此工作的基础上继续努力,最大限度地减少重复处理,并不断改进决策。
Agentic AI also emphasizes persistent memory, a shared context that enables agents to remember previous interactions, track dependencies, and update strategies over time. This form of memory is not just a technical feature but a strategic enabler that allows agents to build upon one another’s work, minimize redundant processing, and refine their decisions continuously.
本质上,人工智能代理是工具,而智能体人工智能则是一个思考系统——它自主、交互式,并且能够进行复杂的规划。在构建现实世界的智能体系统时,这种区别将指导您的架构选择,帮助您选择合适的工具、协调机制和推理策略,从而将人工智能的应用范围从狭义的任务扩展到通用、自主的工作流程。
In essence, while AI agents are tools, agentic AI is a system of thinkers—autonomous, interactive, and capable of complex planning. As you move forward with building real-world agentic systems, this distinction will guide your architectural choices, helping you select the right tools, coordination mechanisms, and reasoning strategies needed to scale beyond narrow AI tasks toward general-purpose, autonomous workflows.
本章探讨了构建智能体GenAI系统的基本原则,重点阐述了AI智能体如何通过结构化的多智能体协调,从被动执行者演化为协作式问题解决者。我们考察了关键的设计模式,例如顺序模式、循环模式、路由模式和层级模式,这些模式使智能体能够在复杂的工作流程中进行推理、检索、行动和适应。在此基础上,我们引入了RAG框架下的HITL架构,展示了人类如何引导或验证智能体的决策。最后,我们区分了传统AI智能体和智能体AI,强调了后者侧重于多步骤规划、编排和自适应学习。这些概念为构建能够处理现实世界复杂性的动态自主系统奠定了基础。在下一章中,我们将从架构模式过渡到执行策略,实现具有评分机制的两阶段GenAI系统。评分机制是生产级应用中质量控制、响应排序和稳健系统评估的关键技术。
In this chapter, we explored the foundational principles of architecting agentic GenAI systems, emphasizing how AI agents evolve from reactive executors to collaborative problem-solvers through structured multi-agent coordination. We examined key design patterns, such as sequential, loop, router, and hierarchical, that enable agents to reason, retrieve, act, and adapt in complex workflows. Building on this, we introduced HITL architectures within RAG, showcasing how humans can guide or validate agentic decisions. Finally, we distinguished between traditional AI agents and Agentic AI, highlighting the latter’s focus on multi-step planning, orchestration, and adaptive learning. These concepts lay the groundwork for building dynamic, autonomous systems capable of handling real-world complexity. In the next chapter, we transition from architectural patterns to execution strategies by implementing two-stage GenAI systems enhanced with grading mechanisms, a crucial technique for quality control, response ranking, and robust system evaluation in production-grade applications.
下一章我们将探讨密集检索中的交互机制及其在两阶段和多阶段 RAG 系统中的关键作用。主题包括重排序策略(例如延迟交互和完全交互)、多向量方法、评分机制,以及具有路由和分阶段推理的多阶段 RAG 工作流的实际实现。
In the next chapter, we will explore interaction mechanisms in dense retrievals and their critical role in two-stage and multi-stage RAG systems. Topics include reranking strategies such as late and full interaction, multi-vector approaches, grading mechanisms, and a practical implementation of a multi-stage RAG workflow with routing and staged reasoning.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
随着生成式人工智能(GenAI )系统在企业、研究和消费应用中日益普及,对可靠且值得信赖的输出的需求也空前高涨。尽管大型语言模型(LLM )能够生成流畅且符合上下文的答案,但它们常常存在一个关键缺陷——“幻觉”。这些捏造或不准确的输出会损害用户信任,并在医疗保健、法律、金融和客户支持等高风险领域带来重大风险。本章介绍了一种实用且可扩展的解决方案:一个两阶段生成流程,在向用户展示答案之前,将答案评分和重排序作为验证层。通过使用反馈循环系统地评估生成的答案,我们实现了从被动生成到主动质量控制的转变,从而为构建更可靠的GenAI系统奠定了基础。
As generative AI (GenAI) systems have become more prevalent in enterprise, research, and consumer applications, the demand for reliable and trustworthy outputs has never been higher. While large language models (LLMs) are capable of generating fluent and contextually appropriate answers, they often suffer from a critical flaw, which is hallucination. These fabricated or inaccurate outputs can undermine user trust and introduce significant risk in high-stakes domains like healthcare, law, finance, and customer support. This chapter introduces a practical and scalable solution: a two-stage generative pipeline that integrates answer grading and reranking as a validation layer before responses are surfaced to users. By systematically evaluating generated answers using a feedback loop, we shift from passive generation to active quality control, laying the foundation for more dependable GenAI systems.
你将使用 Python、LangChain 和 LangGraph 实现此架构,构建一个由检索器、生成器和评分器组成的模块化流程。检索器收集多个相关的知识上下文,生成器提出候选答案,评分器使用自定义的评估提示或评分机制选择最准确或最合适的答案。最终,你不仅会理解答案验证背后的理论,还会获得构建一个既稳健又可用于生产环境的 GenAI 反馈系统的实践经验。
You will implement this architecture using Python, LangChain, and LangGraph, constructing a modular pipeline consisting of a retriever, generator, and grader. The retriever gathers multiple relevant knowledge contexts, the generator proposes candidate answers, and the grader selects the most accurate or appropriate response using custom evaluation prompts or scoring mechanisms. By the end, you will not only understand the theory behind answer validation but also gain hands-on experience in engineering a GenAI feedback system that is both robust and production-ready.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在全面阐述高级检索增强生成(RAG )系统,重点关注密集检索交互和多阶段处理的作用。首先,本章探讨密集检索中交互的基本概念,然后深入讨论两阶段和多阶段RAG架构。接下来,本章介绍用于评估检索和生成质量的评分机制。最后,本章展示了一个具有智能路由的多阶段RAG工作流的实用实现,该工作流能够利用向量搜索和网络搜索进行自适应查询处理。读者将获得清晰的概念理解和构建稳健RAG系统的实践经验。
This chapter aims to provide a comprehensive understanding of advanced retrieval-augmented generation (RAG) systems, with a focus on the role of dense retrieval interactions and multi-stage processing. It begins by exploring the fundamental concepts of interaction in dense retrieval, followed by an in-depth discussion of two-stage and multi-stage RAG architectures. The chapter then introduces the grading mechanisms used to evaluate retrieval and generation quality. Finally, it presents a practical implementation of a multi-stage RAG workflow with intelligent routing, enabling adaptive query processing using vector search and web search. Readers will gain both conceptual clarity and hands-on insights into building robust RAG systems.
在第一章“新时代生成式人工智能简介”中,我们探讨了双编码器和交叉编码器的概念,以及如图 6.4所示的两阶段生成式人工智能系统的架构。在深入了解两阶段生成式人工智能架构之前,让我们先来考察一下不同的交互级别,具体来说,就是无交互、延迟交互和完全交互。
In Chapter 1, Introducing New Age Generative AI, we explored the concepts of bi-encoders and cross-encoders, along with the architecture of a two-stage GenAI system as illustrated in Figure 6.4. Before we get into the two-stage GenAI architecture, let us first examine the different levels of interactions, specifically, no interaction, late interaction, and full interaction.
在第一章“新时代生成式人工智能导论”中,我们介绍了密集检索的概念。在密集检索系统中,查询和文档在编码和比较过程中的交互方式对检索性能和计算效率起着至关重要的作用。总体而言,交互机制可分为三类:无交互、完全交互、延迟交互和多向量表示,每类机制在可扩展性和语义匹配精度之间都存在独特的权衡。
In Chapter 1, Introducing New Age Generative AI, we introduced the concept of dense retrieval. In dense retrieval systems, the way queries and documents interact during encoding and comparison plays a central role in determining both retrieval performance and computational efficiency. Broadly, interaction mechanisms fall into three categories, which are no interaction, full interaction, late interaction, and multi-vector representation, each offering unique trade-offs between scalability and semantic matching precision.
双编码器架构代表了一种可扩展性强但粒度较粗的方法。在这种方法中,查询和文档分别使用独立或共享的神经编码器编码成固定长度的向量嵌入。编码完成后,使用诸如余弦相似度或点积等轻量级相似度函数比较这些向量,从而利用近似最近邻(ANN )搜索实现快速检索。由于其速度和效率,这种方法被广泛应用于大规模系统中。然而,由于查询和文档词元在编码过程中没有交互,语义对齐相对较浅,常常会丢失更精细的上下文线索。这种方法尤其适用于第一阶段检索,因为在第一阶段检索中,速度比精度更重要。
The bi-encoder architecture represents the most scalable yet coarse-grained approach. Here, the query and the document are encoded independently into fixed-length vector embeddings using separate or shared neural encoders. Once encoded, these vectors are compared using lightweight similarity functions such as cosine similarity or dot product, enabling rapid retrieval using approximate nearest neighbor (ANN) search. This approach is widely used in large-scale systems due to its speed and efficiency. However, because the query and document tokens do not interact during encoding, the semantic alignment is relatively shallow, often missing finer contextual cues. This method is particularly useful for first-stage retrieval, where speed is prioritized over precision.
下图显示没有相互作用:
The following figure shows no interaction:
在另一个极端是交叉编码器,或称全交互模型。在这种方法中,查询和文档被连接起来并联合编码,使得每个查询词元都能通过交叉注意力机制与每个文档词元进行交互。由于模型能够跨词元对进行深度语义推理,因此这种设置能够产生极具表现力的表示和精确的相关性评分。然而,这种方法的缺点也很明显:每个文档-查询对都必须在推理时单独评估,这使得该方法对于从大型语料库中检索结果来说成本过高。交叉编码器通常用于对双编码器等较轻量级模型检索到的前k个候选结果进行重新排序。
At the other end of the spectrum lies the cross-encoder, or full interaction model. In this approach, the query and document are concatenated and jointly encoded, allowing every query token to interact with every document token via mechanisms like cross-attention. This setup yields highly expressive representations and precise relevance scores, as the model performs deep semantic reasoning across token pairs. However, the trade-off is substantial: Each document-query pair must be evaluated individually at inference time, making this method prohibitively expensive for retrieval from large corpora. Cross-encoders are often reserved for reranking the top-k candidates retrieved by lighter models like bi-encoders.
下图展示了完整的交互过程:
The following figure shows full interaction:
诸如 ColBERT、ColPali 和 ColQwen 等后期交互模型提供了一种实用的折衷方案。与双向编码器类似,它们分别对查询和文档进行编码。然而,它们并非将表示折叠成单个向量,而是保留词元级嵌入。在检索过程中,使用诸如最大相似度( MaxSim ) 和最大余弦相似度等操作,对每个查询词元嵌入和所有文档词元嵌入进行细粒度比较。最终的相关性得分通过聚合这些词元级相似度来计算,通常使用词元最大得分的总和或平均值。
Late interaction models such as ColBERT, ColPali, and ColQwen offer a practical middle ground. Like bi-encoders, they encode queries and documents independently. However, instead of collapsing representations into a single vector, they retain token-level embeddings. During retrieval, a fine-grained comparison is performed between each query token embedding and all the document token embeddings using operations such as maximum similarity (MaxSim) maximum cosine similarity per token. The final relevance score is then computed by aggregating these token-level similarities, often using a sum or average of maximum scores across tokens.
这种设计实现了基于词元的匹配,而无需像完全注意力机制那样增加计算负担。此外,由于文档词元嵌入可以预先计算并存储(例如,存储在向量数据库中),这些模型在准确性和可扩展性之间实现了有效的平衡。值得注意的是,最近出现的变体,例如用于文本-图像融合的 ColPali 和用于集成 Qwen 等 LLM 的 ColQwen,进一步将后期交互扩展到多模态和生成式场景,其中来自视觉语言模型( VLM ) 或指令调整的 LLM 的嵌入被对齐到一个共享空间中,用于跨模态检索和重排序。
This design enables token-aware matching without the compute burden of full attention. Additionally, because document token embeddings can be precomputed and stored (e.g., in a vector database), these models offer an efficient compromise between accuracy and scalability. Notably, recent variants like ColPali (for text-image fusion) and ColQwen (for integrating LLMs like Qwen) further extend late interaction to multimodal and generative contexts, where embeddings from vision-language models (VLMs) or instruction-tuned LLMs are aligned in a shared space for cross-modal retrieval and reranking.
下图显示了后期交互作用:
The following figure shows late interaction:
选择无交互、延迟交互还是完全交互取决于应用场景。无交互侧重于速度和索引;完全交互侧重于准确性,但扩展性较差;延迟交互旨在兼顾无交互和完全交互的优点,既保留了丰富的词级语义,又具备实际的可扩展性,因此在现代人工智能系统的密集型和多模态检索流程中越来越受欢迎。
The choice between no interaction, late interaction, and full interaction hinges on the application context. No interaction favors speed and indexing; full interaction favors accuracy but scales poorly; late interaction aims for the best of both no interaction and full interaction by preserving rich token-level semantics with practical scalability, making it increasingly popular for dense and multimodal retrieval pipelines in modern AI systems.
向量表示已成为现代信息检索系统的基石,它通过将文档和查询嵌入到高维连续空间中来实现语义搜索。传统的密集检索方法通常通过聚合词级嵌入将整个文档表示为一个单一向量。然而,这种方法往往会丢失细粒度的语义信息,尤其是在处理长文本或信息密集型文本时。
Vector representations have become the backbone of modern information retrieval systems, enabling semantic search by embedding documents and queries into high-dimensional continuous spaces. Traditional dense retrieval methods typically represent an entire document as a single vector by pooling token-level embeddings. However, this approach often loses fine-grained semantic information, particularly in the case of long or information-dense texts.
为了克服这一局限性,多向量表示法应运而生,它允许使用每个实体的多个向量(通常在词元或短语级别)来存储和查询文档。这种设计提高了检索精度,尤其是在需要精确词元级匹配的场景下。诸如 Qdrant 之类的现代向量数据库已经原生支持多向量表示法,为这种细粒度的检索机制提供了可扩展的基础架构。
To overcome this limitation, multi-vector representations have been introduced as a mechanism to store and query-document using multiple vectors per entity, often at the token or phrase-level. This design enhances retrieval precision, particularly in scenarios where exact token-level matching is required. Modern vector databases such as Qdrant have introduced native support for multi-vector representations, providing a scalable infrastructure for such fine-grained retrieval mechanisms.
图 6.4展示了一种多向量表示方法,该方法指的是为单个逻辑数据单元(例如文档或段落)存储多个向量。与典型的密集检索将所有词级嵌入压缩成单个池化表示不同,多向量方法为每个文档保留多个嵌入。这些向量通常源自基于 Transformer 的编码器的词级或短语级输出。
A multi-vector representation is shown in Figure 6.4, which refers to the practice of storing multiple vectors for a single logical unit of data, such as a document or paragraph. Instead of compressing all token-level embeddings into a single pooled representation (as in typical dense retrieval), the multi-vector approach retains multiple embeddings per document. These vectors are often derived from token-level or phrase-level outputs of transformer-based encoders.
这种结构允许在查询时比较查询嵌入和每个文档的组成向量,从而实现更细致、更具上下文感知能力的检索。这对于重排序等任务尤其有利,因为这类任务的目标不仅是粗略检索,而是基于部分语义重叠进行细粒度评分。
This structure allows more nuanced and context-aware retrieval by enabling query time comparison between query embeddings and the constituent vectors of each document. This is particularly advantageous for tasks such as reranking, where the goal is not just coarse retrieval, but fine-grained scoring based on partial semantic overlaps.
Qdrant 为多向量表示提供了一流的支持,允许每个索引实体与多个命名向量场关联。每个向量场都可以独立配置其维度、相似度度量(例如,余弦相似度、点积)和索引策略。典型的配置涉及两个向量场:
Qdrant offers first-class support for multi-vector representations, allowing each indexed entity to be associated with multiple named vector fields. Each vector field can independently be configured with its own dimensionality, similarity metric (e.g., cosine, dot product), and indexing strategy. A typical configuration involves two vector fields:
Qdrant 通过一种名为 MaxSim 的机制实现基于词元的重排序。MaxSim 是一种相似度比较器,它计算每个查询向量与文档向量集之间的 MaxSim 值。这种策略与后期交互模型中使用的重排序逻辑非常相似,并且可以通过Multi-vectorComparator.MAX_SIM设置进行配置。
Qdrant enables token-aware reranking through a mechanism known as MaxSim, a similarity comparator that computes the MaxSim between each query vector and the set of document vectors. This strategy closely mirrors the reranking logic used in late interaction models and can be configured through the Multi-vectorComparator.MAX_SIM setting.
简单地对 HNSW 图中的所有标记级向量进行索引会导致严重的性能瓶颈:
Naively indexing all token-level vectors in an HNSW graph leads to severe performance bottlenecks:
为了解决这个问题,Qdrant 允许对多向量字段选择性地禁用 HNSW 索引。这种优化能够在不牺牲准确性的前提下实现快速数据摄取和轻量级重排序,因为初始检索步骤是通过密集向量字段处理的。
To address this, Qdrant allows HNSW indexing to be disabled selectively for multi-vector fields. This optimization enables fast ingestion and lightweight reranking without sacrificing accuracy, as the initial retrieval step is handled via dense vector fields.
虽然像 Qdrant 这样的数据库中的多向量表示受到了后期交互模型的启发,但这两个概念在范围和作用上有着根本的不同。
While multi-vector representations in databases like Qdrant are inspired by late interaction models, the two concepts differ fundamentally in scope and role.
该表对 Qdrant 等向量数据库中实现的多向量表示和 ColBERT 等后期交互模型架构进行了比较分析。虽然两种方法都旨在利用词元级嵌入来提高检索准确率,但它们的范围、实现和功能却存在显著差异。多向量表示侧重于基础设施层,优化词元嵌入的存储、索引和重排序使用方式。相比之下,后期交互模型在模型层面(通常是在训练和推理阶段)定义嵌入生成和匹配策略。下表重点列出了它们在用途、使用场景、索引策略和系统依赖性方面的关键区别。此比较阐明了两者之间的互补关系,并强调了向量数据库在扩展基于后期交互的检索流程中的作用:
The table provides a comparative analysis between multi-vector representations as implemented in vector databases like Qdrant and late interaction model architectures such as ColBERT. While both approaches aim to leverage token-level embeddings for improved retrieval accuracy, their scope, implementation, and function differ significantly. Multi-vector representations focus on the infrastructure layer, optimizing how token embeddings are stored, indexed, and used during reranking. In contrast, late interaction models define the embedding generation and matching strategy at the model-level, typically during training and inference. The following table highlights key distinctions in their purpose, usage context, indexing strategies, and system dependencies. This comparison clarifies the complementary relationship between the two and underscores the role of vector databases in scaling late interaction-based retrieval pipelines:
|
方面 Dimension |
多向量表示(Qdrant) Multi-vector representations (Qdrant) |
后期交互架构(例如,ColBERT) Late interaction architectures (e.g., ColBERT) |
|
定义 Definition |
支持每个文档多个向量的存储和查询机制。 Storage and querying mechanism supporting multiple vectors per document. |
在查询时执行细粒度相似性分析的模型架构。 Model architecture that performs fine-grained similarity at query time. |
|
目的 Purpose |
为服务令牌级表示而进行的基础设施优化。 Infrastructure optimization for serving token-level representations. |
嵌入层面的语义建模与检索。 Semantic modeling and retrieval at the embedding level. |
|
相似性函数 Similarity function |
使用 MaxSim 或可配置的相似度指标进行重新排名。 Uses MaxSim or configurable similarity metrics for reranking. |
通常固定为查询词和文档词之间计算的 MaxSim 值。 Typically fixed to MaxSim computed between query and document tokens. |
|
索引策略 Indexing strategy |
允许选择性地禁用多向量的索引。 Allows selective disabling of indexing for multi-vectors. |
索引是外部的;文档嵌入被存储以进行匹配。 Indexing is external; document embeddings are stored for matching. |
|
模型依赖性 Model dependency |
可以使用任何输出多个嵌入向量的模型。 Can use any model outputting multiple embeddings. |
需要特定的架构组件(例如,ColBERT 转换器层)。 Requires specific architectural components (e.g., ColBERT transformer layers). |
|
使用语境 Usage context |
生产级向量数据库的基础设施级支持。 Infrastructure-level support in production vector databases. |
训练和推理过程中使用的算法设计。 Algorithmic design used during training and inference. |
Table 6.1: Comparison of multi-vector and late interaction
延迟交互是指一种保留词元级嵌入并在查询时执行交互的建模技术,而Qdrant中的多向量支持则是一种检索和存储机制,能够以高效且可扩展的方式部署此类模型。延迟交互模型生成嵌入;Qdrant的多向量基础架构则高效地存储和利用这些嵌入进行重排序。
Late interaction refers to the modeling technique that retains token-level embeddings and performs interaction at query time, whereas multi-vector support in Qdrant is a retrieval and storage mechanism that enables deployment of such models in an efficient and scalable manner. Late interaction models generate the embeddings; Qdrant’s multi-vector infrastructure stores and utilizes them efficiently for reranking.
多向量表示通过保留每个实体的多个嵌入向量,实现了细粒度的文档检索,这对于诸如 ColBERT 等现代重排序架构至关重要。Qdrant 对多向量字段和词级 MaxSim 评分的支持,为在生产环境中部署此类系统提供了可扩展的基础架构。虽然在概念上与后期交互相关,但多向量支持运行于基础架构层面,通过优化部署和性能特征,对模型层面的创新起到补充作用。
Multi-vector representations enable fine-grained document retrieval by preserving multiple embeddings per entity, a necessity for modern reranking architectures such as ColBERT. Qdrant’s support for multi-vector fields and token-level MaxSim scoring provides a scalable infrastructure for deploying such systems in production environments. While conceptually related to late interaction, multi-vector support operates at the infrastructure-level, complementing model-level innovations by optimizing their deployment and performance characteristics.
在 RAG(评级、可用性、可寻址)的背景下,查询和文档表示之间的交互类型对检索效率和答案准确性起着决定性作用。一个典型的两阶段 RAG 系统通常包含一个初始检索阶段,用于选择候选文档;随后是一个重排序阶段,用于优化候选文档列表,从而提高生成响应的质量。交互的性质和深度(例如,无交互、延迟交互或完全交互)直接影响此类系统的架构设计和性能权衡。
In the context of RAG, the type of interaction between query and document representations plays a foundational role in determining both retrieval efficiency and answer accuracy. A two-stage RAG system typically involves an initial retrieval phase to select candidate documents, followed by a reranking phase that refines this list to improve the quality of the generated response. The nature and depth of interaction, whether no interaction, late interaction, or full interaction, directly impact both the architectural design and performance trade-offs of such systems.
RAG系统的第一阶段通常采用双编码器架构,也称为无交互模型。在这种方法中,查询和文档分别使用独立或共享的神经编码器编码成固定长度的向量表示。这些嵌入向量被存储并使用相似度函数(例如余弦相似度或点积)进行比较,通常通过人工神经网络(ANN)搜索来加速。由于运行时只需对查询进行编码,因此可以实现对大型语料库的可扩展、低延迟检索。然而,编码过程中缺乏跨词元注意力机制可能会限制语义粒度,导致复杂查询的检索精度降低。
The first-stage of a RAG system generally employs a bi-encoder architecture, also referred to as a no interaction model. In this approach, queries and documents are encoded independently into fixed-length vector representations using separate or shared neural encoders. These embeddings are stored and compared using similarity functions such as cosine similarity or dot product, often accelerated through ANN search. This allows for scalable, low-latency retrieval across large corpora, as only the query needs to be encoded at runtime. However, the lack of cross-token attention during encoding may limit semantic granularity, resulting in lower retrieval precision for complex queries.
为了克服双编码器的局限性,RAG流程的第二阶段引入了重排序机制,对检索到的前几名候选模型进行重新评估和优先级排序。这使得具有后期交互或完全交互的模型尤为重要,详情如下:
To address the limitations of the bi-encoder, the second-stage of the RAG pipeline incorporates a reranking mechanism that re-evaluates and prioritizes the top retrieved candidates. This is where models with late interaction or full interaction are particularly relevant, details as follows:
在重排序过程中,初始检索阶段使用快速人工神经网络(ANN)在密集向量上搜索,筛选出候选文档列表。随后,通过比较每个候选文档,对其进行重新排序。它利用诸如 MaxSim 之类的相似度度量,将词元级向量映射到词元级查询嵌入。此过程能够更精确地捕捉特定查询词与相关文档部分之间的对应关系,从而提高排名质量。
During reranking, an initial retrieval stage selects a shortlist of candidate documents using fast ANN search over dense vectors. Subsequently, each candidate is reranked by comparing its token-level vectors to the token-level query embeddings using similarity measures such as MaxSim. This process captures more precise alignment between specific query terms and relevant document parts, resulting in improved ranking quality.
重要的是,基于多向量的重排序无需对单个词元级向量进行索引,这显著降低了内存开销并加快了文档导入速度。通过将粗粒度检索阶段与细粒度评分阶段解耦,多向量重排序为在生产系统中部署后期交互模型提供了一种可扩展的机制。这种混合架构兼顾了速度和检索精度,使其特别适用于高性能语义搜索和 RAG 应用。
Importantly, multi-vector-based reranking does not require indexing the individual token-level vectors, which significantly reduces memory overhead and accelerates document ingestion. By decoupling the coarse retrieval phase from the fine-grained scoring phase, multi-vector reranking provides a scalable mechanism for deploying late interaction models in production systems. This hybrid architecture delivers both speed and retrieval accuracy, making it especially suitable for high-performance semantic search and RAG applications.
在实际的 RAG 架构中,双编码器用于从向量数据库中检索初始文档池(例如,前 100 个候选文档)。然后,这些候选文档会被传递给重排序器,重排序器会根据精度和延迟之间的权衡,采用延迟交互模型或完全交互模型。延迟交互模型提供了一种折衷方案,它支持可扩展的、与人工神经网络 (ANN) 兼容的存储,同时比纯双编码器方法提高了相关性。当精度至关重要且计算资源允许逐对处理时,完全交互模型是理想之选。
In practical RAG architectures, bi-encoders are used to retrieve an initial pool of documents (e.g., top 100 candidates) from a vector database. These candidates are then passed through a reranker that employs either late interaction or full interaction models, depending on the desired trade-off between precision and latency. Late interaction models offer a middle ground by supporting scalable, ANN-compatible storage while improving relevance over pure bi-encoder methods. Full interaction models are ideal when precision is paramount, and computational resources permit per-pair processing.
因此,理解和选择合适的交互范式对于设计有效的两阶段RAG系统至关重要。通过将检索和重排序阶段与合适的编码器架构相匹配,可以实现性能、可扩展性和准确性之间的最佳平衡。
Thus, understanding and selecting the appropriate interaction paradigm is essential for designing effective two-stage RAG systems. By aligning the retrieval and reranking stages with the appropriate encoder architectures, it is possible to achieve an optimal balance between performance, scalability, and accuracy.
下图展示了两阶段 RAG 架构:
The following figure displays the two-stage RAG architecture:
RAG 系统通过检索和利用相关的外部文档来扩展 LLM 的功能。RAG 系统并非仅仅依赖模型内部的参数进行知识检索,而是显式地从索引语料库中获取文档,使生成式响应基于外部的、通常是最新的信息。当重排序机制被整合到这一流程中时,其架构就演化成通常所说的两阶段 RAG 系统。
RAG systems extend the capabilities of LLMs by retrieving and conditioning on relevant external documents. Instead of relying solely on the model's internal parameters for knowledge retrieval, RAG explicitly fetches documents from an indexed corpus to ground the generative response in external, and often up-to-date, information. When reranking mechanisms are incorporated into this pipeline, the architecture evolves into what is commonly termed a two-stage RAG system.
该架构的第一阶段侧重于从大规模文档语料库中高效检索文档。通常,这通过密集向量检索来实现,其中查询和文档都被独立地编码成向量表示。这些向量使用人工神经网络(ANN)搜索技术被索引到向量数据库中。这使得检索能够实现可扩展且低延迟,从而产生大量语义相关的文档,通常包含前五十到前一百个候选文档。然而,该阶段缺乏在编码过程中查询和文档标记之间的显式交互。因此,尽管检索到的文档在主题上与查询相关,但它们的上下文对齐可能不够紧密,从而影响它们在下游生成任务中的效用。
The first-stage of this architecture focuses on efficient retrieval from large-scale document corpora. Typically, this is accomplished using dense vector retrieval, where both queries and documents are encoded independently into vector representations. These vectors are indexed in a vector database using ANN search techniques. This allows for scalable and low-latency retrieval, yielding a broad set of semantically relevant documents, often in the range of top fifty to top hundred candidates. However, this stage lacks explicit interaction between the query and the document tokens during encoding. As a result, although the retrieved documents are topically related to the query, their contextual alignment may be shallow, affecting their usefulness in downstream generation tasks.
第二阶段引入了重排序组件,旨在优化第一阶段生成的候选文档列表。在此阶段,查询与每个候选文档之间应用了更丰富的交互机制。与初始检索不同,重排序模型考虑词元级关系,从而实现更深层次的语义对齐。这些模型可以采用词元级比较、注意力机制或部分交叉注意力结构,在不产生重新处理整个语料库的高昂计算成本的情况下,模拟完整的语义交互。第二次迭代为每个文档生成更准确的相关性得分,并据此重新排序候选文档列表。排名靠前的文档随后作为上下文输入传递给语言模型进行生成。
The second-stage introduces a reranking component designed to refine the shortlist produced in the first-stage. This is where richer interaction mechanisms are applied between the query and each candidate document. Unlike the initial retrieval, reranking models consider token-level relationships, enabling a deeper semantic alignment. These models may employ token-wise comparisons, attention mechanisms, or partial cross-attention structures that simulate full semantic interaction without incurring the high computational cost of reprocessing the entire corpus. This second pass produces a more accurate relevance score for each document and reorders the shortlist accordingly. The top-ranked documents are then forwarded as the contextual input to the language model for generation.
这种被称为快速检索的分阶段方法,之后是精确重排序,体现了可扩展性和准确性之间的战略性权衡。第一阶段确保对庞大的语料库进行快速探索,优先考虑召回率和系统响应速度。第二阶段确保仅使用上下文最相关的文档来调整语言学习模型(LLM),从而提高生成质量、事实一致性和主题相关性。如果没有这个重排序阶段,系统可能会将输出建立在关联性较弱或次优的文档之上,这会降低响应质量甚至产生错误结果。
This bifurcated approach, known as fast retrieval, is followed by precise reranking that embodies a strategic trade-off between scalability and accuracy. The first-stage ensures rapid exploration across a vast corpus, prioritizing recall and system responsiveness. The second-stage ensures that only the most contextually appropriate documents are used to condition the LLM, thereby improving generation quality, factual consistency, and topical relevance. Without this reranking stage, the system risks grounding its output in loosely related or suboptimal documents, which can degrade response quality or introduce hallucinations.
因此,重排序不仅仅是一个优化步骤,更是一种结构性增强,它在 RAG 流程中定义了两个既独立又相互依存的阶段:广度检索和深度重排序。这种两阶段配置确保了检索效用与生成目标的一致性,并已成为现代高性能 RAG 系统(尤其是在开放域、企业级或高精度环境下运行的系统)的基础模式。
So, reranking is not merely an optimization step but a structural enhancement that defines two distinct yet interdependent stages in the RAG pipeline: retrieval for breadth and reranking for depth. This two-stage configuration ensures alignment between retrieval utility and generation goals, and has become a foundational pattern in modern, high-performance RAG systems, especially those operating in open-domain, enterprise, or high-precision contexts.
ColBERT 和 ColPali 等后期交互模型的出现模糊了传统两阶段 RAG 架构与统一的、交互丰富的检索系统之间的界限。这些模型在双编码器的效率和交叉编码器的精确度之间提供了一个引人注目的平衡点。关键问题是,如果使用 ColBERT 或 ColPali 作为检索器,是否还需要单独的重排序阶段?
The emergence of late interaction models like ColBERT and ColPali has blurred the lines between traditional two-stage RAG architectures and unified, interaction-rich retrieval systems. These models offer a compelling middle ground between the efficiency of bi-encoders and the precision of cross-encoders. The key question is that if ColBERT or ColPali is used as the retriever, is a separate reranking stage still necessary?
与标准密集检索器(例如双编码器架构)不同,ColBERT 型模型不会将文档和查询压缩成单个向量。相反,它们保留词元级别的嵌入,并在检索过程中使用诸如 MaxSim 之类的操作来比较查询词元和文档词元之间的嵌入。这既保留了细粒度的语义信息,又可以通过对词元向量进行索引(例如,使用后期交互索引方案)来实现人工神经网络搜索。
Unlike standard dense retrievers (e.g., dual-encoder architectures), ColBERT-type models do not compress documents and queries into a single vector. Instead, they preserve token-level embeddings, which are compared during retrieval using operations like MaxSim between query and document tokens. This preserves fine-grained semantic information while still enabling ANN search via indexing of token vectors (e.g., using late interaction indexing schemes).
因此,ColBERT和ColPali在检索阶段就已经执行了一种复杂的重排序机制。其评分函数考虑了多个词元之间的交互,与传统的双编码器检索相比,语义对齐程度更高,但计算成本却远低于交叉编码器。实际上,这种后期交互机制起到了一种隐式重排序的作用,使检索阶段更加精确。
As a result, ColBERT and ColPali already perform a sophisticated form of reranking at retrieval time. The scoring function considers multiple token interactions, offering far more semantic alignment than traditional bi-encoder retrieval, but without the full computational cost of cross-encoders. In effect, this late interaction serves as an implicit reranking mechanism, making the retrieval stage more precise.
然而,尽管 ColBERT 型模型的表达能力有所提高,但在高风险或高度复杂的应用中,它们仍然可以从第二阶段或两阶段 RAG 算法中获益,该算法使用重排序器。原因包括:
However, despite their improved expressiveness, ColBERT-style models may still benefit from a second-stage or two-stage RAG, which uses a reranker in high-stakes or highly nuanced applications. The reasons include:
在实际系统中(例如,企业级 RAG、长上下文检索、混合多模态设置),使用 ColBERT 生成高召回率短列表,然后使用交叉编码器重排序器,只关注前 k 个候选词,这种情况并不少见。
In practical systems (e.g., enterprise RAG, long-context retrieval, hybrid multimodal setups), it is not uncommon to use ColBERT for high-recall shortlist generation and follow it with a cross-encoder reranker that focuses only on the top-k candidates.
RAG系统通常结合检索和生成,通过外部知识补充生成式语言模型,从而提供准确且与上下文相关的响应。虽然通常被描述为一个两阶段过程:首先检索相关文档,然后根据检索到的内容生成答案,但更高级的RAG实现通常涉及多个检索、过滤、重排序和验证阶段,统称为多阶段RAG。
RAG systems typically combine retrieval and generation to provide accurate and contextually relevant responses by supplementing generative language models with external knowledge. While commonly described as a two-stage process, retrieving relevant documents and subsequently generating answers from the retrieved content, advanced RAG implementations often involve multiple retrieval, filtering, reranking, and validation stages, collectively referred to as multi-stage RAG.
标准 RAG 实现方案包括:
Standard RAG implementations include:
然而,实际应用往往需要额外的中间阶段来提升性能和准确性。这些阶段旨在解决查询歧义、检索结果冗余以及生成幻觉等问题。
However, real-world applications often demand additional intermediary stages to enhance performance and accuracy. These stages address challenges such as ambiguity in queries, redundancy in retrieved results, and the potential for generative hallucination.
多阶段 RAG 架构通常包含以下附加步骤:
A multi-stage RAG architecture typically integrates additional steps such as:
在 RAG 架构中融入多个阶段可以带来以下几个优势:
Incorporating multiple stages into RAG architectures can offer several advantages:
从上一节中,您应该已经了解到,RAG 系统通过结合检索和生成组件,整合外部知识源来增强语言模型。虽然标准的 RAG 框架通常包含两个阶段,即先检索后生成,但新兴的研究和实践表明,将 RAG 扩展到多阶段架构同样有效。这些高级配置克服了简单两阶段系统的局限性,能够以更高的准确率、更强的上下文理解能力和更强的适应性来处理复杂任务。
From the preceding section, you must have understood that RAG systems enhance language models by incorporating external knowledge sources through a combination of retrieval and generation components. While the canonical RAG framework typically involves two-stages, that is, retrieval followed by generation, emerging research and practical implementations demonstrate the efficacy of extending RAG into multi-stage architectures. These advanced configurations address limitations of simple two-stage systems and enable the handling of complex tasks with higher accuracy, contextual understanding, and adaptability.
以下列表概述了不同类型的多阶段 RAG:
The following list outlines different types of multi-stage RAGs:
从简单架构到多阶段 RAG 架构的演进反映了人工智能系统向更具适应性、智能性和情境感知能力的方向发展的更广泛趋势。每种 RAG 变体都针对信息检索和自然语言生成( NLG ) 中的特定挑战,为从日常对话到高风险分析推理等各种应用提供定制化解决方案。随着 RAG 的不断成熟,混合和多智能体配置很可能在知识密集型人工智能工作流程中发挥越来越重要的作用。
The evolution from simple to multi-stage RAG architectures reflects a broader trend toward more adaptive, intelligent, and context-aware AI systems. Each RAG variant addresses specific challenges in information retrieval and natural language generation (NLG), offering tailored solutions for diverse applications ranging from casual conversation to high-stakes analytical reasoning. As RAG continues to mature, hybrid and multi-agent configurations are likely to play an increasingly prominent role in knowledge-intensive AI workflows.
在某些 RAG 变体中,评分机制至关重要,用于评估、排序或选择多个候选答案、检索结果或推理路径。这一评估层确保仅向用户呈现最符合上下文且事实最准确的输出。不同的 RAG 评分机制如下:
Grading mechanisms are integral in certain RAG variants to evaluate, rank, or select among multiple candidate responses, retrievals, or reasoning paths. This evaluative layer ensures that only the most contextually appropriate and factually accurate outputs are surfaced to the user. Different RAG grading mechanisms are as follows:
从本质上讲,评分将 RAG 从纯粹的生成过程转变为更具深思熟虑和评估性的流程,与反思型和多智能体人工智能系统的最新趋势相一致。
In essence, grading transforms RAG from a purely generative process into a more deliberative and evaluative pipeline, aligning with recent trends in reflective and multi-agent AI systems.
尽管多级 RAG 系统具有诸多优势,但它们也引入了复杂性和计算开销。关键考虑因素包括:平衡性能提升与延迟增加之间的关系、高效管理资源分配以及优化级间通信以避免瓶颈。
Despite their advantages, multi-stage RAG systems introduce complexity and computational overhead. Critical considerations include balancing performance gains against latency increases, managing resource allocation efficiently, and optimizing inter-stage communication to avoid bottlenecks.
多阶段 RAG 架构相比传统的两阶段模型有了显著的进步。通过策略性地加入额外的检索、优化和验证步骤,这些复杂的系统更适合高风险的真实世界应用,在这些应用中,准确性、可靠性和上下文理解至关重要。
Multi-stage RAG architectures represent a significant advancement over traditional two-stage models. By strategically incorporating additional retrieval, refinement, and validation steps, these sophisticated systems are better suited for high-stakes, real-world applications where accuracy, reliability, and contextual comprehension are paramount.
在多阶段 RAG 系统的设计和部署中,令牌利用率是一个至关重要的考虑因素。检索、重排序、验证和生成等每个阶段都会消耗一部分可用令牌预算,而可用令牌预算又受到底层语言模型上下文窗口的限制。高效的令牌预算分配直接影响响应的准确性和系统部署的成本效益。
Token utilization is a critical consideration in the design and deployment of multi-stage RAG systems. Each stage of retrieval, reranking, validation, and generation consumes a portion of the available token budget, which is constrained by the context window of the underlying language model. Efficient token budgeting directly impacts both the fidelity of responses and the cost-effectiveness of system deployment.
在多阶段流水线中,代币使用量通常会因以下原因而增加:
In multi-stage pipelines, token usage typically escalates due to:
因此,代币分配必须进行策略性管理。一些技巧包括选择性地截断低排名文档、通过摘要模型进行压缩,或者采用分层排名系统,尽可能减少代币密集型步骤(除非必要)。高级配置则使用路由或分级机制来确定管道的哪些分支值得投入更多代币。
Token allocation must therefore be strategically managed. Some techniques include selective truncation of low-ranking documents, compression via summarization models, or tiered ranking systems that minimize token-intensive steps unless necessary. Advanced configurations use routing or grading mechanisms to determine which branches of the pipeline warrant deeper token investment.
最终,在多阶段 RAG 系统中,令牌优化不仅对计算效率至关重要,而且对在令牌约束下保持模型精度也至关重要。精心管理令牌流能够设计出可扩展、高精度的 RAG 架构,适用于企业级部署。
Ultimately, token optimization in multi-stage RAG systems is essential not only for computational efficiency but also for preserving model accuracy within the token constraints. Thoughtful management of token flow enables the design of scalable, high-precision RAG architectures suitable for enterprise deployment.
在 RAG 流程中,即使初始检索结果看似足够,第二阶段仍然至关重要。尤其是在检索到的文档可能仅部分相关、包含噪声或模糊内容,或者需要额外推理才能确定其有效性的情况下,第二阶段更是如此。第二阶段引入了细化、过滤或验证机制(通常由基于 LLM 的评分器驱动),以确保只有上下文相关的文档才会被传递给生成模块。
In RAG pipelines, a second-stage remains essential even when initial retrieval appears sufficient. This is particularly true in scenarios where retrieved documents may be only partially relevant, contain noisy or ambiguous content, or require additional reasoning to determine their usefulness. The second-stage introduces refinement, filtering, or validation mechanisms—typically powered by LLM-based graders—that help ensure only contextually aligned documents are passed to the generation module.
以下概述了此类第二阶段设置中常用的一种分级组件:
The following outlines a common grading component used in such second-stage setups:
LLM 预期返回一个包含单个键的 JSON 输出:
{
"binary_score": "是"
}
### 检索评分器
# 文档评分器说明
doc_grader_instructions = """您是一名评分员,正在评估检索到的文档与用户问题的相关性。
如果文档包含与问题相关的关键词或语义信息,则将其评为相关。
# 评分提示
doc_grader_prompt = """以下是检索到的文档:\n\n{document}\n\n以下是用户问题:\n\n{question}。
这会仔细、客观地评估该文件是否包含至少一些与问题相关的信息。
返回一个 JSON 对象,其中包含一个键 binary_score,该键的值为“yes”或“no”,用于指示文档是否包含至少一些与问题相关的信息。
# 测试
问题:“什么是思维链引发的?”
docs = retriever.invoke(question)
doc_txt = docs[1].page_content
doc_grader_prompt_formatted = doc_grader_prompt.format(
文档=doc_txt,问题=问题
)
result = llm_json_mode.invoke(
[SystemMessage(content=doc_grader_instructions)]
+ [HumanMessage(content=doc_grader_prompt_formatted)]
)
json.loads(result.content)
The LLM is expected to return a JSON output with a single key:
{
"binary_score": "yes"
}
### Retrieval Grader
# Doc grader instructions
doc_grader_instructions = """You are a grader assessing relevance of a retrieved document to a user question.
If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant."""
# Grader prompt
doc_grader_prompt = """Here is the retrieved document: \n\n {document} \n\n Here is the user question: \n\n {question}.
This carefully and objectively assess whether the document contains at least some information that is relevant to the question.
Return JSON with single key, binary_score, that is 'yes' or 'no' score to indicate whether the document contains at least some information that is relevant to the question."""
# Test
question = "What is Chain of thought prompting?"
docs = retriever.invoke(question)
doc_txt = docs[1].page_content
doc_grader_prompt_formatted = doc_grader_prompt.format(
document=doc_txt, question=question
)
result = llm_json_mode.invoke(
[SystemMessage(content=doc_grader_instructions)]
+ [HumanMessage(content=doc_grader_prompt_formatted)]
)
json.loads(result.content)
法学硕士(LLM)需核实答案是否严格遵循所提供文件的内容,不得引入外部信息。评分过程强调解释驱动型评分,要求进行理性判断,而非简单的二元选择。
预期输出是一个包含以下内容的 JSON 对象:
{
"binary_score": "否",
“解释”:“答案中引入了文档中未包含的模型……”
}
### 幻觉评分器
# 幻觉评分器说明
幻觉评分器说明 = """
你是一名老师,正在批改测验卷。
你会得到事实和学生的答案。
以下是评分标准:
确保学生的答案有事实依据。
确保学生答案中不包含超出事实范围的“臆想”信息。
分数:
得分“是”表示学生的答案符合所有标准。这是最高分(最佳分数)。
得分“否”表示学生的答案不符合所有标准。这是最低分。
请逐步解释你的推理过程,以确保你的推理和结论是正确的。
避免一开始就直接给出正确答案。
# 评分提示
hallucination_grader_prompt = """事实:\n\n{documents}\n\n学生答案:{generation}。
返回一个包含两个键的 JSON 数据:`binary_score` 键表示学生答案是否基于事实,值为 `'yes'` 或 `'no'`;`explanation` 键则包含对答案的解释。
# 使用上述文档和生成方法进行测试
hallucination_grader_prompt_formatted = hallucination_grader_prompt.format(
documents=docs_txt,generation=generation.content
)
result = llm_json_mode.invoke(
[SystemMessage(content=hallucination_grader_instructions)]
+ [HumanMessage(content=hallucination_grader_prompt_formatted)]
)
json.loads(result.content)
The LLM is instructed to verify whether the answer adheres strictly to the content in the provided documents, without introducing external information. The process emphasizes explanation-driven grading, requiring a reasoned judgment rather than a simple binary decision.
The expected output is a JSON object containing:
{
"binary_score": "no",
"explanation": "The answer introduces models not found in the document..."
}
### Hallucination Grader
# Hallucination grader instructions
hallucination_grader_instructions = """
You are a teacher grading a quiz.
You will be given FACTS and a STUDENT ANSWER.
Here is the grade criteria to follow:
Ensure the STUDENT ANSWER is grounded in the FACTS.
Ensure the STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.
Score:
A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score.
A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct.
Avoid simply stating the correct answer at the outset."""
# Grader prompt
hallucination_grader_prompt = """FACTS: \n\n {documents} \n\n STUDENT ANSWER: {generation}.
Return JSON with two two keys, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER is grounded in the FACTS. And a key, explanation, that contains an explanation of the score."""
# Test using documents and generation from above
hallucination_grader_prompt_formatted = hallucination_grader_prompt.format(
documents=docs_txt, generation=generation.content
)
result = llm_json_mode.invoke(
[SystemMessage(content=hallucination_grader_instructions)]
+ [HumanMessage(content=hallucination_grader_prompt_formatted)]
)
json.loads(result.content)
输出结果是一个结构化的 JSON 对象,包含以下内容:
该评分器有助于区分模糊、无意义的答案和提供实质性、相关见解的答案。
{
"binary_score": "是",
“解释”:“答案清楚地阐述了 Llama 3.2 的视觉模型,并将其与问题联系起来。”
}
### 答案评分器
# 回答评分员说明
answer_grader_instructions = """您是一名正在批改测验的老师。
你会看到一个问题和一个学生答案。
以下是评分标准:
(1)学生的答案有助于回答问题。
分数:
得分“是”表示学生的答案符合所有标准。这是最高分(最佳分数)。
如果答案包含题目中未明确要求的额外信息,则学生可以获得“是”的分数。
得分“否”表示学生的答案不符合所有标准。这是最低分。
请逐步解释你的推理过程,以确保你的推理和结论是正确的。
避免一开始就直接给出正确答案。
# 评分提示
answer_grader_prompt = """问题:\n\n{question}\n\n学生答案:{generation}。
返回一个包含两个键的 JSON 数据:`binary_score` 键,值为“yes”或“no”,用于指示学生答案是否符合标准;以及 `explanation` 键,用于解释得分原因。
# 测试
问题:“今天作为 Llama 3.2 的一部分发布的愿景模型有哪些?”
答案 = “今天发布的 Llama 3.2 型号包括两款视觉型号:Llama 3.2 11B Vision Instruct 和 Llama 3.2 90B Vision这些模型可通过托管计算在 Azure AI 模型目录中获取。它们是 Meta 首次涉足多模态 AI 领域,在视觉推理方面可与 Anthropic 的 Claude 3 Haiku 和 OpenAI 的 GPT-4o mini 等封闭模型相媲美。它们取代了较早的仅支持文本的 Llama 3.1 模型。
# 使用上述问题和生成器进行测试
answer_grader_prompt_formatted = answer_grader_prompt.format(
问题=问题,生成=答案
)
result = llm_json_mode.invoke(
[SystemMessage(content=answer_grader_instructions)]
+ [HumanMessage(content=answer_grader_prompt_formatted)]
)
json.loads(result.content)
The output is a structured JSON object with:
This grader helps differentiate between vague, uninformative answers and those that provide substantial, relevant insights.
{
"binary_score": "yes",
"explanation": "The answer clearly states the Llama 3.2 vision models and relates them to the question."
}
### Answer Grader
# Answer grader instructions
answer_grader_instructions = """You are a teacher grading a quiz.
You will be given a QUESTION and a STUDENT ANSWER.
Here is the grade criteria to follow:
(1) The STUDENT ANSWER helps to answer the QUESTION
Score:
A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score.
The student can receive a score of yes if the answer contains extra information that is not explicitly asked for in the question.
A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct.
Avoid simply stating the correct answer at the outset."""
# Grader prompt
answer_grader_prompt = """QUESTION: \n\n {question} \n\n STUDENT ANSWER: {generation}.
Return JSON with two two keys, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER meets the criteria. And a key, explanation, that contains an explanation of the score."""
# Test
question = "What are the vision models released today as part of Llama 3.2?"
answer = "The Llama 3.2 models released today include two vision models: Llama 3.2 11B Vision Instruct and Llama 3.2 90B Vision Instruct, which are available on Azure AI Model Catalog via managed compute. These models are part of Meta's first foray into multimodal AI and rival closed models like Anthropic's Claude 3 Haiku and OpenAI's GPT-4o mini in visual reasoning. They replace the older text-only Llama 3.1 models."
# Test using question and generation from above
answer_grader_prompt_formatted = answer_grader_prompt.format(
question=question, generation=answer
)
result = llm_json_mode.invoke(
[SystemMessage(content=answer_grader_instructions)]
+ [HumanMessage(content=answer_grader_prompt_formatted)]
)
json.loads(result.content)
您可以根据 RAG 流程的阶段和评估目标(例如,正确性、事实性、相关性、流畅性)设计多种类型的评分器。请参考下表,其中描述了各种评分器类型、其用途、适用场景和示例:
You can design multiple types of graders depending on the stage of your RAG pipeline and the evaluation goals (e.g., correctness, factuality, relevance, fluency). Refer to the following table, as it describes comprehensive grader types, their purpose, when to use them, and examples:
|
分级机类型 Grader type |
目的 Purpose |
何时使用 When to use |
示例用例 Example use case |
|
检索相关性评分器 Retrieval relevance grader |
检查检索到的文档是否与查询相关。 Check if the retrieved document is relevant to the query. |
检索之后,生成之前。 After retrieval, before generation. |
确保只有与迁移学习相关的文档才能提交给法学硕士项目。 Ensure only documents relevant to what is transfer learning? Are passed to LLM. |
|
幻觉评分员 Hallucination grader |
检查生成的答案是否基于检索到的事实。 Check if the generated answer is grounded in retrieved facts. |
生成之后,最终输出之前。 After generation, before the final output. |
验证 LLM 生成的关于 Llama 3.2 的答案是否与检索到的文档相符。 Verify if the LLM-generated answer about Llama 3.2 matches the retrieved documents. |
|
答案质量评分器 Answer quality grader |
判断生成的答案是否有效地回答了问题。 Judge whether the generated answer meaningfully addresses the query. |
展出前的最终评估阶段。 Final evaluation stage before display. |
判断答案是否正确且有效地解释了梯度下降的工作原理。 Decide if the answer explains how does gradient descent works? Correctly and helpfully. |
|
忠诚度评级器 Faithfulness grader |
与幻觉评分器类似,但侧重于逻辑一致性。 Similar to the hallucination grader, but focused on logical alignment. |
当LLM可能推断出未明确说明的结论时使用。 Use when LLM may infer unstated conclusions. |
检查有关因果推断的答案背后的推理是否有来源支持。 Check if the reasoning behind an answer about causal inference is backed by sources. |
|
完整性评分器 Completeness grader |
判断答案是否涵盖了问题的所有必要子部分。 Judge if the answer covers all required sub-parts of the question. |
对于多部分或复合问题。 For multi-part or compound questions. |
评估GPT 模型的优势和风险是什么? Evaluate the answer to what are the benefits and risks of GPT models? |
|
连贯性评分器 Coherence grader |
检查答案在逻辑和语法上是否正确。 Checks whether the answer is logically and grammatically well-formed. |
当你想确保可读性和清晰度时。 When you want to ensure readability and clarity. |
确保长篇答案逻辑清晰、条理分明,没有矛盾或漏洞。 Ensure that a long-form answer flows well and has no contradictions or gaps. |
|
毒性或偏见分级器 Toxicity or bias grader |
检测有害、有偏见或不恰当的内容。 Detect harmful, biased, or inappropriate content. |
出于安全考虑,在部署或展示之前。 For safety, before deployment or display. |
过滤掉答案中有关种族、性别和政治的偏见性言论。 Filter out biased statements in answers about race, gender, and politics. |
|
简洁性评分器 Conciseness grader |
请确保答案简洁明了,不要偏离主题。 Ensure the answer is not verbose or off-topic. |
当您需要简短的答案时(例如,用于摘要或移动设备)。 When you want short answers (e.g., for summaries or mobile use). |
将关于量子计算的答案精简到50字以内。 Trim the answer about quantum computing to fit within 50 words. |
|
一致性评分员 Consistency grader |
检查类似问题的答案是否一致。 Check if answers to similar questions are consistent. |
用于评估多轮或批量输出。 For evaluating multi-turn or batch outputs. |
确保“什么是人工智能?”这个问题的答案与人工智能的定义保持一致。 Ensure that answers to what is AI? and the definition of AI are aligned. |
|
遵循指示的评分器 Instruction-following grader |
评估对具体指示或限制的遵守情况。 Evaluate adherence to specific instructions or constraints. |
当提示包含自定义说明时(例如,只列出三点)。 When prompts contain custom instructions (e.g., list three points only). |
检查LLM是否遵循说明,例如使用项目符号或避免使用数学符号。 Check if LLM follows instructions like using bullet points or avoiding math symbols. |
|
证据归因评分器 Evidence attribution grader |
检查答案的来源是否正确引用。 Check if the source of an answer is cited properly. |
适用于知识密集型质量保证或学术应用。 For knowledge-intensive QA or academic applications. |
确保关于研究论文的答案包含类似(Smith et al. ,2022)的引用。 Ensure the answer about a research paper includes a citation like (Smith et al., 2022). |
Table 6.2: Types of graders in RAG pipelines
Chapter_6.ipynb中包含多个代码实现,这些实现使用了 LangChain、LangGraph 和集成了 Ollama 的本地语言学习模型(例如 Llama 3.2),构建了多种多阶段、检索增强型问答系统。该系统可以从预嵌入的本地向量存储或通过实时网络搜索检索信息,智能地路由查询,生成答案,然后对输出结果的质量和相关性进行评分。
In the Chapter_6.ipynb there are multiple code implementations of a various multi-stage, retrieval-augmented question answering systems using LangChain, LangGraph, and Ollama-integrated local LLMs (e.g., Llama 3.2). The system retrieves information either from a pre-embedded local vector store or via a live web search, routes queries intelligently, performs generation, and then grades the output for quality and relevance.
它首先使用NomicEmbeddings嵌入特定领域的文档(例如,关于智能体和对抗攻击的博客文章) ,并将它们存储在SKLearnVectorStore中。问题通过 JSON 模式的路由模型进行路由,该模型决定是使用向量存储还是通过 Tavily API 进行网络搜索。
It begins by embedding domain-specific documents (e.g., blog posts on agents and adversarial attacks) using NomicEmbeddings and storing them in an SKLearnVectorStore. Questions are routed through a JSON-mode router model that determines whether to use the vector store or web search via the Tavily API.
文档检索完成后,检索评分器会检查其与查询的相关性。如果文档相关,系统会调用基于提示的红黄绿(RAG)生成器。生成的答案会通过幻觉评分器(确保其合理性)和答案质量评分器(评估其完整性)进行验证。
Once documents are retrieved, a retrieval grader checks relevance to the query. If the documents are relevant, the system invokes a prompt-based RAG generator. Generated answers are validated with a hallucination grader (to ensure grounding) and an answer quality grader (to assess completeness).
整个系统通过 LangGraph 状态机进行协调,从而实现路由、评分、生成和引用等环节的条件流程。这种设计确保了响应的自适应合成,它利用静态知识和实时网络数据,并集成了质量控制机制,以保证可靠性和可信度。
The whole system is orchestrated via a LangGraph state machine, allowing conditional flow through routing, grading, generation, and citation. This design ensures adaptive response synthesis, using both static knowledge and live web data, with integrated quality control mechanisms for reliability and trustworthiness.
下图展示了一个多阶段的 RAG 工作流,该工作流集成了路由逻辑,用于确定查询解析的最佳路径。此架构结合了从矢量存储库中检索传统内容、对文档进行相关性分级以及可选的备用策略(例如网络搜索)。如果检索到的内容不足或无用,系统将启动一个生成阶段,并根据其效用情况进行重试和条件退出。这种支持路由的设计确保了对各种输入类型和检索失败情况的稳健处理,从而在实际部署中提升了准确性和适应性。
The following figure illustrates a multi-stage RAG workflow that incorporates routing logic to determine the most appropriate path for query resolution. This architecture combines traditional retrieval from a vector store, document grading for relevance, and optional fallback strategies such as web search. If the retrieved content is found to be insufficient or not useful, the system invokes a generation phase with retries and conditional exits based on utility. Such a routing-enabled design ensures robust handling of diverse input types and retrieval failures, supporting enhanced accuracy and adaptability in real-world deployments.
本章深入探讨了RAG系统的发展现状,重点阐述了从基本的两阶段模型向更复杂的多阶段架构的转变。我们探讨了密集检索交互如何支撑相关性匹配,以及评分机制如何通过评估事实性、相关性和完整性来增强响应的可信度。通过实现具有智能路由的多阶段RAG工作流程,我们展示了如何根据问题类型和内容质量动态选择检索源和生成路径。这种模块化和自适应设计为在实际应用中构建可扩展、可靠且具有上下文感知能力的GenAI系统铺平了道路。
In this chapter, we delved into the evolving landscape of RAG systems, emphasizing the shift from basic two-stage models to more sophisticated multi-stage architectures. We explored how dense retrieval interactions underpin relevance matching and how grading mechanisms enhance the trustworthiness of responses by assessing factuality, relevance, and completeness. Through the implementation of a multi-stage RAG workflow with intelligent routing, we demonstrated how retrieval sources and generation pathways can be dynamically selected based on question type and content quality. This modular and adaptive design paves the way for scalable, reliable, and context-aware GenAI systems in real-world applications.
下一章,我们将实现一个多模态检索系统,重点关注检索组件。
In the next chapter, we will implement a multimodal retrieval system, focusing exclusively on the retrieval component.
在日益视觉化和互联互通的数字世界中,跨不同模态(例如文本和图像)搜索和检索信息的能力已成为高级人工智能( AI ) 应用的基石。本章将介绍多模态检索的概念,即系统旨在理解和关联文本和视觉输入。与仅依赖文本相似性的传统搜索引擎不同,多模态系统利用图像和文本的矢量表示来提供更丰富、更具上下文关联性的搜索结果。您将学习如何构建这样一个系统:集成 Qdrant 作为矢量数据库,使用Hugging Face的对比语言-图像预训练( CLIP )模型生成图像嵌入,并使用 LangChain 来协调检索过程。这些工具支持对多种数据格式的统一访问,使用户能够执行灵活的跨模态搜索,例如从图像中检索描述或识别与文本输入匹配的图像。
In an increasingly visual and interconnected digital world, the ability to search and retrieve information across different modalities, such as text and images, has become a cornerstone of advanced artificial intelligence (AI) applications. This chapter introduces the concept of multimodal retrieval, where systems are designed to understand and correlate both textual and visual inputs. Unlike traditional search engines that rely solely on textual similarity, multimodal systems use vector representations from both images and text to deliver richer, more contextually aligned results. You will learn how to build such a system by integrating Qdrant as a vector database, Contrastive Language-Image pre-Training (CLIP) models from Hugging Face for generating image embeddings, and LangChain to orchestrate the retrieval process. These tools enable unified access to multiple data formats, allowing users to perform flexible cross-modal searches, such as retrieving descriptions from images or identifying images that match textual inputs.
本章将指导您构建双索引向量存储,并开发能够处理各种查询格式的混合检索器。基于 Python 的实现将引导您完成索引工作流、嵌入管道以及在不同模态之间无缝切换的检索逻辑。除了技术架构之外,本章还将深入探讨一些实用的设计决策,例如相似度评分、模态优先级排序和自定义检索逻辑。最终,您将掌握部署可用于生产环境的多模态检索器的技能,这为电子商务推荐、视觉内容发现等用例奠定了坚实的基础。以及语义搜索引擎。这种实践方法不仅能确保您理解理论,还能让您获得实施可扩展的实际解决方案的能力。
Throughout the chapter, you will construct dual-index vector stores and develop hybrid retrievers capable of handling diverse query formats. Python-based implementations will guide you through indexing workflows, embedding pipelines, and retrieval logic that switches seamlessly between modalities. Beyond technical architecture, the chapter delves into practical design decisions like similarity scoring, modality prioritization, and custom retrieval logic. By the end, you will have the skills to deploy a production-ready multimodal retriever, a foundation applicable to use cases in e-commerce recommendations, visual content discovery, and semantic search engines. This hands-on approach ensures you not only understand the theory but also gain the ability to implement scalable, real-world solutions.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章的目标是设计并实现一个能够处理文本和图像输入的多模态检索系统。读者将学习如何使用双编码器对来自多种模态的数据进行预处理和嵌入,如何规范化向量表示,以及如何高效地将其存储在诸如 Qdrant 之类的向量数据库中。该系统支持跨模态查询,例如从文本提示中检索图像,以及从图像输入中检索文本内容,从而实现跨异构数据类型的语义搜索。本章为构建智能的、模态感知型应用程序奠定了技术基础,并为读者在后续章节中通过集成生成模型来进一步扩展系统做好了准备。
The objective of this chapter is to design and implement a multimodal retrieval system capable of handling both text and image inputs. Readers will learn how to preprocess and embed data from multiple modalities using bi-encoders, normalize vector representations, and store them efficiently in a vector database such as Qdrant. The system supports cross-modal queries such as retrieving images from text prompts and textual content from image inputs, enabling semantic search across heterogeneous data types. This chapter lays the technical foundation for building intelligent, modality-aware applications and prepares readers to extend the system further by incorporating generative models in the subsequent chapter.
本节以第二章“深入探索多模态系统”中介绍的基础概念为基础,概述了多模态系统的四种关键输出类型:文本到图像、图像到文本、文本和图像到图像以及文本到规范和图像。这些类别定义了如何使用不同的输入模态组合来生成特定的输出格式,构成了现代多模态人工智能应用的基础。通过基于输出类型对系统进行分类,我们创建了一个更清晰的框架,用于理解从图像生成模型到图像描述工具和规范驱动的设计引擎等各种技术如何在实际场景中发挥作用。这种分类不仅强化了前面提到的区别,而且还提供了一个结构化的视角,通过这个视角可以更好地理解和解决后续章节中提到的检索和生成方面的挑战。以下是简要回顾:
Building upon the foundational concepts introduced in Chapter 2, Deep Dive into Multimodal Systems, this section offers an overview of the four key output-based classifications of multimodal systems that are text-to-image, image-to-text, text and image-to-image, and text to specifications and image. These categories define how different combinations of input modalities are used to produce specific output formats, forming the backbone of modern multimodal AI applications. By organizing systems based on their output types, we create a clearer framework for understanding how diverse technologies, from image generation models to captioning tools and specification-driven design engines, function in real-world scenarios. This classification not only reinforces the distinctions made earlier but also provides a structured lens through which the retrieval and generation challenges in upcoming chapters can be better understood and implemented. A quick recap is as follows:
从理论角度来看,多模态系统涵盖了转换、对齐和融合三种范式。文本到图像和图像到文本系统主要关注模态之间的转换。文本到图像和图像到图像类展示了在图像生成之前对组合的多模态嵌入进行融合的过程。最后,文本到规范和图像类融合了转换(文本到规范)、结构生成(规范到图像)和融合,能够处理符号和视觉输出。
Viewed through a theoretical lens, multimodal systems span translation, alignment, and fusion paradigms. Text-to-image and image-to-text systems primarily focus on translating between modalities. The text and image-to-image class demonstrates the fusion of combined multimodal embeddings before image generation. Lastly, the text to specifications and image category blends translation (text to specs), structure generation (specs to image), and fusion, handling both symbolic and visual outputs.
识别这些类别对于设计多模态检索系统至关重要,例如第六章“两阶段和多阶段GenAI系统”中讨论的混合检索器,其中索引、查询和检索必须适应各种不同的输入/输出模态。分类决定了我们如何构建向量存储、制定嵌入策略以及定义跨模态搜索能力,以完成诸如查找符合规范的图像或从图像中检索规范之类的任务。
Recognizing these categories is crucial for designing multimodal retrieval systems, such as the hybrid retrievers discussed in Chapter 6, Two and Multi-stage GenAI Systems, where indexing, querying, and retrieval must accommodate diverse input/output modalities. Such classification informs how we build vector stores, craft embedding strategies, and define cross-modal search capabilities for tasks like finding images matching a specification or retrieve specs from an image.
图 7.1展示了一个多模态检索系统架构,其中检索过程可在文本和图像模态之间无缝运行。从学术角度来看,这种方法利用了嵌入(一种基于向量的表示方法,用于捕捉数据中的语义关系),并分别针对文本和图像内容生成嵌入。
Figure 7.1 illustrates a multimodal retrieval system architecture, where retrieval operates seamlessly across text and image modalities. Academically, this approach leverages embeddings, a vector-based representation capturing semantic relationships within data, generated separately for textual and visual content.
该过程始于用户查询,查询内容可以包含文本、图像或两者兼有。这些输入会被传递给专门的嵌入模型:文本嵌入模型用于处理文本查询或文档,图像嵌入模型用于处理视觉输入。文档会被分块成更小的单元,以提高检索的粒度和效率,而图像则直接嵌入到向量表示中。
The process initiates with user queries, which may consist of text, images, or both. These inputs are passed to specialized embedding models: a text embedding model for textual queries or documents, and an image embedding model for visual inputs. Documents undergo chunking into smaller units to improve the granularity and efficiency of retrieval, whereas images are directly embedded into the vector representation.
词嵌入计算完成后,会被存储在一个多模态向量数据库中,该数据库旨在处理混合数据类型。系统接收到查询后,会在该数据库中执行向量相似性搜索,并基于语义接近性而非精确匹配来检索结果。最终返回的结果(包含文本块和图像)会呈现给用户。
Once embeddings are computed, they are stored in a multimodal vector database designed to handle mixed data types. Upon receiving a query, the system performs vector similarity searches across this database, retrieving results based on semantic proximity rather than exact matches. The returned results, combining textual chunks and images, are then provided back to the user.
在各种应用场景中,这种多模态检索系统是跨模态搜索、基于内容的图像检索和集成语义推荐系统等高级应用的基础。文本和图像嵌入的结合使用提高了检索的准确性和上下文相关性,从而支持更丰富、更直观的用户交互。
In contexts, such multimodal retrieval systems are foundational for advanced applications like cross-modal search, content-based image retrieval, and integrated semantic recommendation systems. The combined use of text and image embeddings enhances the accuracy and contextual relevance of retrieval, supporting richer, more intuitive user interactions.
多模态检索系统的出现标志着信息检索领域的一项重大进步,它使系统能够处理和语义对齐不同数据模态(例如文本和图像)的内容。本文讨论的架构示意图展示了一个强大的框架,该框架将文本和图像嵌入管道与共享或协调的向量空间集成在一起,以实现高精度的跨模态搜索。本节将全面阐述支撑该架构的技术组件、数据流机制和系统设计原则。
The advent of multimodal retrieval systems marks a pivotal advancement in the field of information retrieval, enabling systems to process and semantically align content across distinct data modalities such as text and images. The architectural schematic under discussion illustrates a robust framework that integrates text and image embedding pipelines with a shared or coordinated vector space for high-precision, cross-modal search. This section provides a comprehensive exposition of the technical components, data flow mechanisms, and system design principles underpinning architectures.
下图展示了一个多模态检索系统的架构,该系统整合了文本和图像数据,以实现统一的查询处理。用户查询可能包含文本和/或图像,并使用针对每种模态定制的独立嵌入模型进行编码。这些嵌入存储在一个多模态向量数据库中,该数据库支持跨文档块和图像的联合检索。查询发生时,系统执行向量相似性搜索,并返回语义一致的两种数据类型的结果,从而实现稳健且上下文丰富的响应生成。
The following figure presents the architecture of a multimodal retrieval system that integrates both textual and visual data for unified query processing. User queries, which may include text and/or images, are encoded using separate embedding models tailored to each modality. These embeddings are stored in a multimodal vector database that supports joint retrieval across document chunks and images. Upon querying, the system performs vector similarity search and returns semantically aligned results from both data types, thereby enabling robust and context-rich response generation.
该系统的架构依赖于多模态处理,用户查询可以来自文本、图像或二者的组合。为了有效处理这种情况,该流程采用了专门的嵌入模型,将这些输入统一到一个共享的语义空间中。关键组件概述如下:
The system’s architecture relies on multimodal processing, where user queries can originate from text, images, or a combination of both. To handle this effectively, the pipeline employs specialized embedding models that unify these inputs into a shared semantic space. The key components are outlined in the following list:
由此产生的嵌入提供了与模态无关的编码,有助于在异构数据类型之间进行高效的相似性搜索。
The resulting embeddings provide modality-agnostic encodings that facilitate efficient similarity search across heterogeneous data types.
向量存储必须支持:
该架构可以选择性地采用双索引结构,其中文本和图像嵌入分别存储和查询,并在后处理期间应用融合逻辑。
The vector store must support:
This architecture may optionally employ dual-index structures, where text and image embeddings are stored and queried separately, with fusion logic applied during post-processing.
此阶段会生成一个检索向量的排名列表,每个向量对应于数据库中存储的文本块或图像片段。
This stage yields a ranked list of retrieved vectors, each corresponding to a text chunk or image segment stored in the database.
最终的呈现层弥合了密集的、基于矢量的内部表示与用户认知期望之间的差距,从而提供可解释的、与上下文相关的结果。
This final presentation layer bridges the gap between the dense, vector-based internal representation and the user's cognitive expectations, delivering explainable and contextually relevant results.
这些特性共同提升了系统的可扩展性、响应速度和实时应用的有效性。
These features collectively contribute to the system's scalability, responsiveness, and effectiveness in real-time applications.
此类架构是众多人工智能驱动型应用的基础,包括但不限于:
Such architectures are foundational in a range of AI-driven applications, including but not limited to:
现在您已经了解了上述多模态检索系统如何在一个统一的架构中融合自然语言处理(NLP)、计算机视觉和向量相似性搜索。通过实现无缝的跨模态交互,它为在复杂的信息环境中进行实时、语义丰富的检索提供了一个强大的框架。对于数据科学和生成式人工智能(GenAI )的从业者而言,掌握此类系统的设计和实现对于推进多模态人工智能应用的发展至关重要。让我们通过一个代码示例来理解它。该代码已包含在本书中。
So now that you understand that the multimodal retrieval system detailed above exemplifies the convergence of NLP, computer vision, and vector similarity search in a unified architecture. By enabling seamless cross-modal interaction, it provides a powerful framework for real-time, semantically rich retrieval across complex information landscapes. For data science and generative AI (GenAI) practitioners, mastering the design and implementation of such systems is essential to advancing the state of multimodal AI applications. Let us understand it using a code example. The code is shared as part of this book.
以下Python库构成了实现多模态检索系统所需的基础软件栈,该系统集成了文本和图像嵌入、向量索引和实时交互功能。每个软件包都经过精心挑选,以支持向量表示学习、语义搜索、文档解析和交互式界面设计等关键功能。
The following Python libraries constitute the foundational software stack required to implement a multimodal retrieval system that integrates text and image embeddings, vector indexing, and real-time interaction. Each package has been carefully selected to support key functionalities such as vector representation learning, semantic search, document parsing, and interactive interface design.
以下代码提供了一个实用且可扩展的双向多模态搜索系统实现,该系统集成了基于嵌入的语义理解和可扩展的向量存储后端。其模块化设计使得可以轻松扩展其他模态(例如音频、表格数据)以及高级功能,例如跨模态重排序、混合检索或用户反馈循环。它堪称使用轻量级 Web 框架和可组合 AI 组件在实时应用中实现多模态嵌入的典范。
The following code provides a practical and extensible implementation of a bidirectional multimodal search system, integrating embedding-based semantic understanding with a scalable vector store backend. Its modular design allows for the straightforward extension of additional modalities (e.g., audio, tabular data) and advanced features such as cross-modal reranking, hybrid retrieval, or user feedback loops. It serves as a canonical example of operationalizing multimodal embeddings in real-time applications using lightweight web frameworks and composable AI components.
这款基于 Streamlit 的应用程序提供了一个轻量级的用户界面,用于执行双向多模态检索,使用户能够进行文本到图像以及图像到文本的搜索。该实现集成了向量嵌入、相似性搜索和基于有效载荷的检索,并清晰地展示了如何通过现代化的交互式界面实现多模态嵌入。
This streamlit-based application presents a lightweight user interface for performing bidirectional multimodal retrieval, enabling users to search from text-to-image and from image-to-text. The implementation integrates vector embedding, similarity search, and payload-based retrieval, and provides a clear example of how multimodal embeddings can be operationalized through a modern, interactive interface.
以下部分将详细分解并概述多模态检索应用程序的关键功能组件,涵盖初始化、界面设计以及文本到图像和图像到文本路径的查询处理。每个步骤对于实现用户输入、嵌入生成和基于向量的语义搜索之间的无缝交互都至关重要。
The following section breaks down and outlines the key functional components of the multimodal retrieval application, covering initialization, interface design, and query processing for both text-to-image and image-to-text pathways. Each step is crucial in enabling seamless interaction between user inputs, embedding generation, and vector-based semantic search.
import streamlit as st
从 PIL 导入图像
导入系统
导入操作系统
以下代码片段确定根目录的绝对路径,并将其附加到系统路径,以允许跨模块导入:
ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
如果 ROOT_DIR 不在 sys.path 中:
sys.path.append(ROOT_DIR)
这样可以确保在项目层次结构中访问子模块(例如rag.index_builder )而不会出现导入错误,从而促进模块化软件设计。
import streamlit as st
from PIL import Image
import sys
import os
The following snippet determines the absolute path of the root directory and appends it to the system path to allow cross-module imports:
ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
if ROOT_DIR not in sys.path:
sys.path.append(ROOT_DIR)
This ensures that submodules within the project hierarchy (e.g., rag.index_builder) can be accessed without import errors, promoting modular software design.
from rag.index_builder import build_vectorstores, TEXT_COLLECTION, IMAGE_COLLECTION
@st.cache_resource(show_spinner="正在加载向量索引...")
def init_system():
返回 build_vectorstores()
客户端,mm_embed = init_system()
这里调用`build_vectorstores() ` 来初始化多模态向量索引。该函数使用了`@st.cache_resource`装饰器,用于缓存结果,避免每次页面刷新时都重新初始化嵌入或加载向量数据。这对于加载大型嵌入模型或查询向量数据库等高延迟操作尤为重要。
from rag.index_builder import build_vectorstores, TEXT_COLLECTION, IMAGE_COLLECTION
@st.cache_resource(show_spinner="Loading vector index...")
def init_system():
return build_vectorstores()
client, mm_embed = init_system()
Here, build_vectorstores() is invoked to initialize the multimodal vector index. The function is decorated with @st.cache_resource, which caches the result to avoid reinitializing embeddings or loading vector data on every page refresh. This is particularly important for high-latency operations, such as loading large embedding models or querying vector databases.
st.title(" 🔍多模态搜索演示(文本↔图像)")
option = st.radio("选择您的查询类型:", ["文本 → 图片", "图片 → 文本"])
界面首先显示标题,然后是用于选择两种交互模式的单选按钮:
st.title(" 🔍 Multimodal Search Demo (Text ↔ Image)")
option = st.radio("Choose your query type:", ["Text → Image", "Image → Text"])
The interface begins with a title, followed by a radio button selection for two modes of interaction:
这种条件分支驱动着应用程序流程的其余部分。
This conditional branching drives the remainder of the application flow.
如果选项 == "文本 → 图像":
query = st.text_input("请输入文本提示以获取相关图像:")
如果查询:
st.write(f"正在搜索与以下图片相似的图片:*{query}*")
q_vec = mm_embed.get_text_embedding(query)
用户提交文本提示后,会使用 ` get_text_embedding()`方法将其嵌入到高维向量中。这种向量表示形式能够捕捉输入查询的语义意图。
res = client.query_points(
collection_name=IMAGE_COLLECTION,
查询=q_vec,
使用="image",
with_payload=["image"],
limit=1,
)
嵌入的查询向量通过语义相似性搜索提交到向量数据库(IMAGE_COLLECTION )。仅检索前 1 个匹配项。参数with_payload=["image"]表示应将关联的图像文件名与向量匹配项一起返回。
如果 res 和 res.points:
image_file = res.points[0].payload["image"]
st.image(f"data/images/{image_file}", caption="最佳匹配", use_column_width=True)
别的:
st.warning("未找到匹配的图像。")
如果返回结果,则使用有效载荷来查找并渲染匹配的图像。如果在矢量存储中不存在语义上接近的匹配项,则会相应地通知用户。
if option == "Text → Image":
query = st.text_input("Enter a text prompt to retrieve relevant image:")
if query:
st.write(f"Searching for image similar to: *{query}*")
q_vec = mm_embed.get_text_embedding(query)
Once the user submits a text prompt, it is embedded into a high-dimensional vector using the get_text_embedding() method. This vector representation captures the semantic intent of the input query.
res = client.query_points(
collection_name=IMAGE_COLLECTION,
query=q_vec,
using="image",
with_payload=["image"],
limit=1,
)
The embedded query vector is submitted to the vector database (IMAGE_COLLECTION) using a semantic similarity search. Only the top-1 match is retrieved. The parameter with_payload=["image"] indicates that the associated image filename should be returned alongside the vector match.
if res and res.points:
image_file = res.points[0].payload["image"]
st.image(f"data/images/{image_file}", caption="Top Match", use_column_width=True)
else:
st.warning("No image match found.")
If a result is returned, the payload is used to locate and render the matching image. If no semantically close match exists in the vector store, the user is notified accordingly.
elif option == "Image → Text":
uploaded_img = st.file_uploader("上传图片以查找相关文本", type=["png", "jpg", "jpeg"])
反向模式下,用户上传一张图片。上传的图片会临时保存到磁盘以供后续处理:
如果 uploaded_img:
with open("temp_input_image.jpg", "wb") as f:
f.write(uploaded_img.read())
st.image("temp_input_image.jpg", caption="上传的图片", use_column_width=True)
保存后,图像会被传递给get_image_embedding()方法:
img_vec = mm_embed.get_image_embedding("temp_input_image.jpg")
然后使用生成的向量查询 TEXT_COLLECTION:
res = client.query_points(
collection_name=TEXT_COLLECTION,
查询=img_vec,
使用="text",
with_payload=["source"],
limit=1,
)
向量搜索会检索语义最相关的文本片段。有效载荷["source"]包含检索到的文本内容:
如果 res 和 res.points:
source_text = res.points[0].payload["source"]
st.success("匹配项最多:")
st.write(source_text)
别的:
st.warning("未找到相关文本。")
结果随后会显示在界面上。如果没有结果达到相似度阈值,则会发出警告信息。
elif option == "Image → Text":
uploaded_img = st.file_uploader("Upload an image to find related text", type=["png", "jpg", "jpeg"])
For the reverse mode, the user uploads an image. The uploaded image is temporarily saved to disk for further processing:
if uploaded_img:
with open("temp_input_image.jpg", "wb") as f:
f.write(uploaded_img.read())
st.image("temp_input_image.jpg", caption="Uploaded Image", use_column_width=True)
Once saved, the image is passed to the get_image_embedding() method:
img_vec = mm_embed.get_image_embedding("temp_input_image.jpg")
The resulting vector is then used to query the TEXT_COLLECTION:
res = client.query_points(
collection_name=TEXT_COLLECTION,
query=img_vec,
using="text",
with_payload=["source"],
limit=1,
)
The vector search retrieves the most semantically relevant text snippet. The payload["source"] contains the retrieved textual content:
if res and res.points:
source_text = res.points[0].payload["source"]
st.success("Top matching text:")
st.write(source_text)
else:
st.warning("No relevant text found.")
Results are then rendered on the interface. In case no result meets the similarity threshold, a warning message is issued.
所示的目录结构体现了多模态检索系统清晰且模块化的组织方式。根文件夹data包含两个子目录:documents和images。documents文件夹通常存储文本源(例如 PDF、文本或 Markdown 文件),这些文本源稍后会被分块并使用文本编码器进行嵌入。images 文件夹包含视觉数据(例如 PNG、JPG) ,这些数据将使用图像嵌入模型进行处理。这种分离结构支持对每种模态进行独立的预处理和索引,从而简化了多模态嵌入、存储和检索工作流程,尤其适用于搜索、图像描述或跨模态问答等任务。
The directory structure shown reflects a clean and modular organization for a multimodal retrieval system. The root folder data contains two subdirectories: documents and images. The documents folder typically stores textual sources (e.g., PDFs, text, or Markdown files) that are later chunked and embedded using text encoders. The images folder contains visual data (e.g., PNG, JPG) to be processed using an image embedding model. This separation supports independent preprocessing and indexing of each modality, facilitating streamlined multimodal embedding, storage, and retrieval workflows in systems built for tasks like search, captioning, or cross-modal question answering.
Figure 7.2: The images folder contains visual data and textual data
检索系统的文件夹结构:以下目录展示了检索系统及其关联部分,代表了 RAG 系统的核心模块。它包含处理数据处理不同阶段的 Python 源文件:
Folder structure of the retrieval system: the following directory showcases the retrieval system and its association represents the core module of a RAG system. It contains Python source files that handle different stages of data processing:
__pycache__目录存储编译后的字节码,用于在执行过程中进行性能优化。这种结构体现了良好的模块化设计和清晰的职责分离。我们将在下一节中详细讨论这些内容。
The __pycache__ directory stores compiled bytecode for performance optimization during execution. This structure reflects good modular design and clear separation of concerns. Let us discuss these in more detail in the following section.
该代码定义了两个实用函数 ` load_pdfs_and_texts()`和`load_images()` ,分别用于接收和预处理文本数据和图像数据。这些函数是多模态检索系统的关键组件,有助于创建跨文档和图像模态的语义对齐嵌入。该实现利用了 LangChain 框架进行结构化文档处理和文本分块,并采用了一种系统化的方法来准备原始数据,以便将其嵌入到下游向量数据库中并建立索引。下面我们详细了解一下代码:
The code defines two utility functions, load_pdfs_and_texts() and load_images(), to ingest and preprocess textual and visual data, respectively. These functions serve as critical components in a multimodal retrieval system, facilitating the creation of semantically aligned embeddings across document and image modalities. The implementation leverages the LangChain framework for structured document handling and text chunking and adopts a principled approach to prepare raw data for embedding and indexing in downstream vector databases. Let us understand the code in detail:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
导入操作系统
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
import os
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
for fname in os.listdir(folder_path):
文件夹中的每个文件都会根据特定条件进行处理:
文档(page_content=chunk,metadata={"source": fname})
这种元数据能够实现下游可追溯性和来源归属,这对于需要溯源性的基于检索的应用来说至关重要。
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
for fname in os.listdir(folder_path):
Each file in the folder is processed conditionally:
Document(page_content=chunk, metadata={"source": fname})
This metadata enables downstream traceability and source attribution, which are essential in retrieval-based applications where provenance is required.
Document(page_content=os.path.join(folder_path, f), metadata={"image": f})
该设计符合 LangChain 的文档模式,确保文本和图像输入尽管来自不同的模态,但都与统一的文档处理流程兼容。
Document(page_content=os.path.join(folder_path, f), metadata={"image": f})
This design aligns with LangChain's Document schema, ensuring that both text and image inputs are compatible with a unified document-processing pipeline despite originating from different modalities.
在这些情况下,此类预处理例程是构建 RAG、语义搜索和跨模态对齐系统的基础,其中跨模态的一致文档表示至关重要。
In these settings, such preprocessing routines are foundational to building systems for RAG, semantic search, and cross-modal alignment, where consistent document representations across modalities are critical.
这段代码定义了一个函数,用于使用llama_index库初始化 Hugging Face 嵌入模型,以用于多模态任务。它从llama_index.embeddings.huggingface导入HuggingFaceEmbedding ,并将模型设置为openai/clip-vit-base-patch32 ,这是一个流行的 CLIP 模型,可以将文本和图像联合嵌入到同一个语义空间中。函数get_mm_embedder()接受一个设备参数(例如,cpu或cuda ),并返回一个嵌入接口,该接口可以为两种模态生成向量表示。trust_remote_code =True标志允许执行来自 Hugging Face 代码库的自定义代码,从而实现更灵活的模型加载。
This code snippet defines a function to initialize a Hugging Face embedding model for multimodal tasks using the llama_index library. It imports HuggingFaceEmbedding from llama_index.embeddings.huggingface and sets the model to openai/clip-vit-base-patch32, a popular CLIP model that jointly embeds text and images in the same semantic space. The function get_mm_embedder() accepts a device argument (e.g., cpu or cuda) and returns an embedding interface that can generate vector representations for both modalities. The trust_remote_code=True flag allows execution of custom code from Hugging Face repositories, enabling more flexible model loading.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
_MODEL_ID = "openai/clip-vit-base-patch32"
_MODEL_ID = "openai/clip-vit-base-patch32"
def get_mm_embedder(device: str = "cpu"):
def get_mm_embedder(device: str = "cpu"):
返回 HuggingFaceEmbedding(model_name=_MODEL_ID, device=device, trust_remote_code=True)
return HuggingFaceEmbedding(model_name=_MODEL_ID, device=device, trust_remote_code=True)
本代码中使用的CLIP 模型(openai/clip-vit-base-patch32 )是一个双编码器,因为它使用独立的神经网络分支分别将文本和图像编码成向量。文本和图像这两种模态分别通过各自的编码器(文本使用 Transformer,图像使用 ViT)并行处理,推理过程中无需直接交互。最终得到的嵌入向量被投影到一个共享的潜在空间,并在该空间中计算两者之间的相似度(例如,余弦距离)。这种架构计算效率高,通过预先分别计算每种模态的嵌入向量,可以快速完成图像到文本或文本到图像的匹配等检索任务。
The CLIP model (openai/clip-vit-base-patch32) used in this code functions as a bi-encoder because it independently encodes text and images into vectors using separate neural network branches. Each modality, text and image, is processed in parallel through its respective encoder (a transformer for text and a ViT for images), without direct interaction during inference. The resulting embeddings are projected into a shared latent space, where similarity (e.g., cosine distance) is computed between the two. This architecture is computationally efficient and enables fast retrieval tasks like image-to-text or text-to-image matching by precomputing embeddings for each modality separately.
index_builder.py函数实现了多模态检索系统的核心原则:将不同模态的数据编码到语义对齐的向量空间中,并存储起来以进行快速的、基于相似性的搜索。通过按模态分离收集逻辑并使用余弦归一化嵌入,该实现遵循了向量搜索架构的最佳实践。该设计采用模块化结构,易于扩展到其他模态(例如音频或表格数据),并且已准备好用于 RAG(红绿灯)、跨模态搜索和智能信息访问等生产环境。接下来,我们将进一步探索index_builder.py 。
The index_builder.py function operationalizes a central tenet of multimodal retrieval systems: encoding disparate modalities into semantically aligned vector spaces and storing them for fast, similarity-based search. By separating collection logic by modality and using cosine-normalized embeddings, the implementation adheres to best practices in vector search architecture. The design is modular, easily extensible to additional modalities (e.g., audio or tabular data), and production-ready for RAG, cross-modal search, and intelligent information access. Let us explore index_builder.py further.
`build_vectorstores()`函数在 Qdrant 向量数据库的本地实例中构建并填充两个独立的向量集合——一个用于文本,一个用于图像。这使得在多模态检索系统中能够实现基于相似性的快速检索,用户查询可以是文本性质的,也可以是图像性质的。
The function build_vectorstores() constructs and populates two separate vector collections—one for text and one for images—within a local instance of the Qdrant vector database. This enables fast, similarity-based retrieval in multimodal retrieval systems, where user queries may be textual or visual in nature.
该实现方案利用模块化组件进行文档摄取、嵌入生成、规范化和存储,确保架构具有可扩展性、可解释性和易于扩展性。下面我们详细了解一下这些组件:
The implementation leverages modular components for document ingestion, embedding generation, normalization, and storage, ensuring that the architecture remains scalable, interpretable, and easily extendable. Let us understand them in detail:
from pathlib import Path
从 typing 导入 List
from rag.embedding_utils import get_mm_embedder
from qdrant_client import QdrantClient, models
from rag.loaders import load_pdfs_and_texts, load_images
from langchain.schema import Document
from numpy.linalg import norm
定义常量以指向存储路径和集合名称:
DB_PATH = "data/qdrant_mm"
TEXT_COLLECTION = "vdr_text"
图像集合 = "vdr_images"
from pathlib import Path
from typing import List
from rag.embedding_utils import get_mm_embedder
from qdrant_client import QdrantClient, models
from rag.loaders import load_pdfs_and_texts, load_images
from langchain.schema import Document
from numpy.linalg import norm
Constants are defined to point to storage paths and collection names:
DB_PATH = "data/qdrant_mm"
TEXT_COLLECTION = "vdr_text"
IMAGE_COLLECTION = "vdr_images"
def normalize(vecs):
返回 [v /norm(v) for v in vecs]
此函数对所有向量执行 L2 归一化,以确保嵌入向量的长度为单位长度。这在使用余弦相似度时至关重要,因为余弦相似度仅依赖于向量的方向而非长度。归一化保证了在高维嵌入空间中距离计算的一致性。
def normalize(vecs):
return [v / norm(v) for v in vecs]
This function performs L2 normalization on all vectors to ensure unit-length embeddings. This is essential when using cosine similarity, which depends solely on vector direction rather than magnitude. Normalization guarantees consistent distance calculations in the high-dimensional embedding space.
text_docs: List[Document] = load_pdfs_and_texts("data/documents")
image_docs: List[Document] = load_images("data/images")
文本和图像文件分别从不同的目录中导入,使用预定义的加载器将每个条目包装成 LangChain 风格的文档对象,从而保留内容和元数据。
text_docs: List[Document] = load_pdfs_and_texts("data/documents")
image_docs: List[Document] = load_images("data/images")
Text and image files are ingested from separate directories using pre-defined loaders that wrap each entry into LangChain-style Document objects, preserving content and metadata.
embedder = get_mm_embedder()
embedder = get_mm_embedder()
初始化多模态嵌入器(通常基于 CLIP)。该嵌入器提供以下功能:
The multimodal embedder (typically CLIP-based) is initialized. This embedder provides:
text_vecs = normalize(embedder.get_text_embedding_batch([...]))
image_vecs = normalize(embedder.get_image_embedding_batch([...]))
text_vecs = normalize(embedder.get_text_embedding_batch([...]))
image_vecs = normalize(embedder.get_image_embedding_batch([...]))
每种模态的嵌入都经过规范化处理,并准备插入到 Qdrant 中。
Each modality’s embeddings are normalized and prepared for insertion into Qdrant.
client = QdrantClient(path=DB_PATH)
使用位于data/qdrant_mm的持久存储初始化本地 Qdrant 实例。
如果 client.collection_exists(...):
client.create_collection(...)
如果两个独立的向量集合尚不存在,则会创建它们:
两者都使用余弦相似度作为距离度量,维度(size=dim )由第一个嵌入向量推断得出。这种集合分离确保了每个模态的优化检索,同时允许后续采用混合或后期融合检索策略。
client = QdrantClient(path=DB_PATH)
A local Qdrant instance is initialized with persistent storage located at data/qdrant_mm.
if not client.collection_exists(...):
client.create_collection(...)
Two separate vector collections are created if they do not exist already:
Both use cosine similarity as the distance metric, and the dimensionality (size=dim) is inferred from the first embedding vector. This separation of collections ensures optimized retrieval per modality while allowing for hybrid or late-fusion retrieval strategies later.
client.upload_points(
文本集合,
[models.PointStruct(id=i, vector=text_vecs[i], payload={"source": d.page_content}) for i, d in enumerate(text_docs)],
)
client.upload_points(
TEXT_COLLECTION,
[models.PointStruct(id=i, vector=text_vecs[i], payload={"source": d.page_content}) for i, d in enumerate(text_docs)],
)
client.upload_points(
图像集合,
[models.PointStruct(id=i, vector=image_vecs[i], payload={"image": Path(d.page_content).name}) for i, d in enumerate(image_docs)],
)
图像嵌入也以类似方式上传,图像文件名作为元数据存储。这些元数据对于后续的检索操作至关重要,因为查询结果必须呈现给用户或用于进一步的推理。
client.upload_points(
IMAGE_COLLECTION,
[models.PointStruct(id=i, vector=image_vecs[i], payload={"image": Path(d.page_content).name}) for i, d in enumerate(image_docs)],
)
Image embeddings are similarly uploaded, with the image filename stored as metadata. This metadata is critical for subsequent retrieval operations where query results must be rendered back to the user or used for further reasoning.
返回客户端,嵌入器
return client, embedder
该函数返回以下两项:
The function returns both:
执行前请确保以下事项:
Ensure the following before executing:
该脚本通常运行一次,要么是在语料库设置期间,要么是在语料库重新索引期间。
This script is typically run once, either during setup or re-indexing of your corpus.
以下代码片段用于执行完整的嵌入和索引流程。它首先从rag.index_builder模块导入build_vectorstores函数,该函数负责加载文档和图像,生成它们的嵌入向量,并将生成的向量存储在 Qdrant 向量数据库中。此函数封装了构建多模态向量库所需的所有关键组件。
The following code snippet is for executing the entire embedding and indexing pipeline. It begins by importing the build_vectorstores function from the rag.index_builder module, which is responsible for loading documents and images, generating their embeddings, and storing the resulting vectors in a Qdrant vector database. This function encapsulates all the key components required for preparing a multimodal vector store.
from rag.index_builder import build_vectorstores
from rag.index_builder import build_vectorstores
构建向量存储()
build_vectorstores()
print("嵌入完成。矢量存储已加载到 Qdrant 中。")
print(" Embedding complete. Vector stores loaded into Qdrant.")
调用`build_vectorstores()`函数时,系统会执行以下几个任务:从`data/documents`目录读取文本数据,从`data/images`目录读取图像数据;使用共享的多模态嵌入器(例如 CLIP)为两种模态生成归一化的向量嵌入;如果 Qdrant 数据库尚未创建,则对其进行初始化。此外,系统还会创建文本向量和图像向量的独立集合(如果不存在),并将数据及其关联的元数据上传,以便将来检索。
When build_vectorstores() is called, the system performs several tasks: it reads textual data from data/documents and images from data/images, uses a shared multimodal embedder (such as CLIP) to generate normalized vector embeddings for both modalities, and initializes the Qdrant database if it has not already been created. It also creates separate collections for text and image vectors (if they do not exist) and uploads the data along with associated metadata for future retrieval.
最后,` print()`语句确认索引过程已成功执行。该脚本通常在系统设置期间或文档/图像语料库更新时执行一次。运行脚本前,请确保所有依赖项均已安装,并且所需的目录结构以及有效的文档文件均已就位。
Finally, the print() statement confirms the successful execution of the indexing process. This script is typically executed once during system setup or any time the document/image corpus is updated. Before running the script, ensure that all dependencies are installed and that the required directory structure is in place, along with valid content files.
虽然目前的实现方案建立了一个稳健的多模态检索流程,但需要注意的是,它尚不支持生成式输出。该系统允许……该系统能够根据用户查询高效检索语义相关的文本或图像,但并不具备自然语言生成(NLG )功能。这种设计体现了一种经典的纯检索架构。为了将其发展成为功能齐全的RAG系统,我们鼓励读者为该流程添加生成功能。
While the current implementation establishes a robust multimodal retrieval pipeline, it is important to recognize that it does not yet support generative outputs. The system allows for efficient retrieval of semantically relevant text or images based on a user’s query, but stops short of performing natural language generation (NLG). This design reflects a classical retrieval-only architecture. To evolve this into a full-fledged RAG system, readers are encouraged to extend the pipeline with generative capabilities.
首先,读者需要集成一个语言学习模型(LLM),例如 GPT、Llama 或 Mistral,该模型能够利用查询和检索到的内容生成连贯的响应。这需要构建一个包装器,将检索器与生成模块耦合起来。LangChain 或 LlamaIndex 等库提供了 RetrievalQA 或 RAG 链等高级抽象,可以简化这一过程。这些框架允许将检索到的文档直接作为上下文传递给 LLM,从而生成答案、摘要或语义解释等形式的输出。
To begin, readers should integrate a LLM—such as GPT, Llama, or Mistral—that can synthesize coherent responses using both the query and retrieved content. This requires constructing a wrapper that couples the retriever with a generation module. Libraries such as LangChain or LlamaIndex offer high-level abstractions like RetrievalQA or RAG chain, which streamline this process. These frameworks allow retrieved documents to be passed directly as context into the LLM, enabling output generation in the form of answers, summaries, or semantic interpretations.
对于图像嵌入包含在检索结果中的多模态场景,可能需要额外的步骤。由于大多数语言学习模型(LLM)处理文本,读者应使用图像到文本模型将检索到的图像预处理成图像描述,或者使用原生支持图像输入的多模态语言学习模型(例如 GPT-4V、LLaVA 或 Kosmos-2)。这种改进将使系统能够生成跨越视觉和文本领域的上下文描述或见解。
For multimodal scenarios where image embeddings are part of the retrieval results, an additional step may be required. Since most LLMs operate on text, readers should either preprocess retrieved images into captions using image-to-text models or employ multimodal LLMs (e.g., GPT-4V, LLaVA, or Kosmos-2) that natively support image inputs. This enhancement will allow the system to generate contextualized descriptions or insights that span both visual and textual domains.
总之,希望扩展此项目的读者应重点关注以下方面:
In summary, readers seeking to extend this project should focus on:
这种生成式扩展不仅将系统从语义匹配提升到智能推理,而且还使其与当前多模态问答和文档理解领域的最先进水平保持一致。
This generative extension not only elevates the system from semantic matching to intelligent reasoning but also aligns it with the current state-of-the-art in multimodal question answering and document understanding.
本章指导读者设计并实现一个多模态检索系统,该系统利用向量嵌入技术整合文本和图像输入。本章演示了如何预处理文档和图像,如何利用诸如 CLIP 之类的双编码器对其进行嵌入,以及如何将其存储在 Qdrant 向量数据库中以实现高效的语义搜索。该系统支持跨模态查询(文本到图像、图像到文本),并为实际应用奠定了坚实的基础。虽然目前的配置能够实现检索,但未来的扩展计划是集成用于 RAG 的 LLM,从而使系统能够生成跨模态的连贯且上下文感知的输出。
This chapter guided the reader through the design and implementation of a multimodal retrieval system that integrates text and image inputs using vector embeddings. It demonstrated how to preprocess documents and images, embed them with bi-encoders like CLIP, and store them in a Qdrant vector database for efficient semantic search. The system supports cross-modal querying (text-to-image, image-to-text) and establishes a solid foundation for real-world applications. While the current setup enables retrieval, a future extension involves integrating LLMs for RAG, allowing the system to generate coherent, context-aware outputs across modalities.
在下一章中,我们将通过构建一个完整的多模态检索和生成系统来实现缺失的生成组件。
In the next chapter, we will implement the missing generative component by building a complete multimodal retrieval and generation system.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
本章我们将解决最后一个环节,即第七章“构建双向多模态检索系统”中的生成组件,方法是将我们的多模态检索流程扩展为一个完整的检索增强生成(RAG )系统。到目前为止,我们主要关注文档和图像的索引、将它们嵌入到共享的向量空间以及根据用户查询检索相关的文本或图像。在本章中,我们将集成一个大型语言模型(LLM ),利用检索到的条目合成连贯且上下文相关的响应。我们将演示如何将检索器封装在生成链中,如何设计能够融合用户查询和检索到的上下文的提示模板,以及如何处理文本到图像和图像到文本的工作流程。在本章结束时,您将拥有一个完整的端到端多模态系统,该系统不仅能够查找相关内容,还能生成富有洞察力的答案和摘要。
In this chapter, we address the one remaining piece, the generative component from Chapter 7, Building a Bidirectional Multimodal Retrieval Systems, by extending our multimodal retrieval pipeline into a full retrieval-augmented generation (RAG) system. Up to now, we have focused on indexing documents and images, embedding them into a shared vector space, and retrieving relevant text or visuals based on user queries. Here, we will integrate a large language model (LLM) to synthesize coherent, context-aware responses using those retrieved items. We will demonstrate how to wrap the retriever in a generation chain, craft prompt templates that blend the user’s query with retrieved context, and handle both text-to-image and image-to-text workflows. By the end of this chapter, you will have a complete, end-to-end multimodal system capable not only of finding relevant content but also of generating insightful answers and summaries.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章全面概述了学习型学习模型(LLM)系统中的高级评估和推荐策略。首先,本章探讨了生成技术,重点阐述了LLM如何生成上下文感知输出以驱动下游任务。在此基础上,本章介绍了多模态推荐方法,这些方法整合了文本、图像和其他数据模态,以提升个性化推荐和用户参与度。为了确保生成和推荐输出的质量和相关性,本章探讨了评分机制,这是一种由LLM驱动的自动化评估技术,用于评估检索的准确性、一致性和真实性。这些评分策略构成了新兴的“LLM即评判员”范式的基础,在该范式中,LLM不仅负责生成响应,还负责对响应进行排序和验证。这种相互关联的视角强调了生成、推荐和评分如何协同工作,从而支持可扩展、可信赖的人工智能系统。
This chapter provides a comprehensive overview of advanced evaluation and recommendation strategies in LLM systems. It begins by examining generation techniques, highlighting how LLMs produce context-aware outputs that drive downstream tasks. Building on this, it introduces multimodal recommendation methods that integrate text, image, and other data modalities to improve personalization and user engagement. To ensure the quality and relevance of these generated and recommended outputs, the chapter explores grading mechanisms, automated assessment techniques powered by LLMs that evaluate retrieval accuracy, coherence, and factuality. These grading strategies form the basis for the emerging paradigm of LLM-as-judge, where the LLM is tasked not only with generating responses but also with ranking and validating them. This interconnected view underscores how generation, recommendation, and grading work in concert to support scalable, trustworthy AI systems.
本节以第 7 章“构建双向多模态检索系统”中介绍的基本概念为基础,通过将我们的多模态检索管道扩展到完整的 RAG 系统,实现了生成组件。
Building upon the foundational concepts introduced in Chapter 7, Building Bidirectional Multimodal Retrieval Systems, this section offers an implementation of the generative component by extending our multimodal retrieval pipeline into a full RAG system.
在前一章中,我们实现了图 8.1 所示的多模态检索系统架构,该架构能够无缝地在文本和图像模态之间进行检索。从学术角度来看,这种方法利用了嵌入(一种基于向量的表示方法,用于捕捉数据中的语义关系),并分别针对文本和图像内容生成嵌入。
In the preceding chapter, we implemented Figure 8.1, a multimodal retrieval system architecture, where retrieval operates seamlessly across text and image modalities. Academically, this approach leverages embeddings, a vector-based representation capturing semantic relationships within data, generated separately for textual and visual content.
该过程始于用户查询,查询内容可以包含文本、图像或两者兼有。这些输入会被传递给专门的嵌入模型:文本查询或文档使用文本嵌入模型,图像输入使用图像嵌入模型。文档会被分块成更小的单元,以提高检索的粒度和效率,而图像则直接嵌入到向量表示中。
The process initiates with user queries, which may consist of text, images, or both. These inputs are passed to specialized embedding models: a text embedding model for textual queries or documents, and an image embedding model for visual inputs. The documents undergo chunking into smaller units to improve the granularity and efficiency of retrieval, whereas images are directly embedded into the vector representation.
词嵌入计算完成后,会被存储在一个多模态向量数据库中,该数据库旨在处理混合数据类型。系统接收到查询后,会在该数据库中执行向量相似性搜索,并基于语义接近性而非精确匹配来检索结果。最终返回的结果(包含文本块和图像)会呈现给用户。
Once embeddings are computed, they are stored in a multimodal vector database designed to handle mixed data types. Upon receiving a query, the system performs vector similarity searches across this database, retrieving results based on semantic proximity rather than exact matches. The returned results, combining textual chunks and images, are then provided back to the user.
本章将实现如图 8.2所示的生成式人工智能( GenAI ) 系统的生成部分,具体来说,就是图中圆圈部分。图 8.2中的多模态 RAG 流水线能够将文本和图像数据无缝集成到一个统一的语义搜索和生成系统中。该架构强调模块化和可扩展性,使其适用于知识检索、视觉问答( VQA ) 和人工智能驱动的文档理解等多种应用。通过利用能够跨多种模态存储和搜索的向量数据库,该系统能够基于文本和视觉证据,实现更丰富的交互和更准确的响应。
In this chapter, we will implement the generation part of the generative AI (GenAI) system as shown in Figure 8.2, specifically the portion of the circle. This multimodal RAG pipeline in Figure 8.2 enables seamless integration of text and image data into a unified semantic search and generation system. The architecture emphasizes modularity and extensibility, making it suitable for a wide range of applications in knowledge retrieval, visual question answering (VQA), and AI-powered document understanding. By utilizing a vector database capable of storing and searching across multiple modalities, the system facilitates richer interaction and more accurate responses, grounded in both textual and visual evidence.
图 8.2所示的系统概述了一个多模态信息检索和生成框架,该框架整合了文本和视觉数据,以增强用户交互。该架构利用特定模态的嵌入模型和统一的向量数据库来检索相关信息,随后由 LLM 将其合成为连贯的响应。该设计针对需要跨模态推理的应用进行了优化,例如视觉问答 (VQA)、图像增强文档搜索或交互式多模态助手。
The presented system in Figure 8.2 outlines a multimodal information retrieval and generation framework that integrates both textual and visual data to support enhanced user interaction. This architecture leverages modality-specific embedding models and a unified vector database to retrieve relevant information, subsequently synthesized into a coherent response by LLM. The design is optimized for applications requiring cross-modal reasoning, such as VQA, document search with image augmentation, or interactive multimodal assistants.
以下部分介绍了一个完整的端到端流程,用于构建一个无缝集成文本和视觉数据的多模态 RAG 系统。用户可以提交文本或图像查询,这些查询会通过专门的嵌入模型进行路由,生成统一的矢量表示。这些嵌入存储在共享的矢量数据库中,从而支持跨文档块和图像内容的跨模态相似性搜索。检索结果中,前 k 个最相关的结果将作为响应生成过程的基础,该过程由 LLM 提供支持。最终输出是一个包含丰富上下文的自然语言响应,反映了查询意图和嵌入的知识。以下列表概述了所有必需的模块:
The following section presents a comprehensive end-to-end pipeline for building a multimodal RAG system that seamlessly integrates textual and visual data. Users can submit either text or image queries, which are routed through specialized embedding models to produce unified vector representations. These embeddings are stored in a shared vector database, enabling cross-modal similarity search across both document chunks and image content. Upon retrieval, the top-k relevant results ground the response generation process, powered by a LLM. The final output is a context-rich natural language response reflecting both the query intent and embedded knowledge. The following list outlines all the required modules:
本章代码包含了导入数据、构建索引、运行检索和生成响应所需的所有 Python 模块。如需详细了解代码块,请参阅第 7 章“构建双向多模态检索系统”中的“代码实现与解释”部分,以下是一个简要的检查清单:
The code of this chapter contains every Python module you need to ingest data, build your indexes, run retrieval, and generate responses. For a detailed understanding of the code blocks, please refer to Chapter 7, Building a Bidirectional Multimodal Retrieval Systems, section: Code implementation and explanation, a quick checklist is as follows:
检索流程搭建完毕后,本节将重点转向生成器,它在将检索到的上下文转换为自然语言响应方面起着至关重要的作用。其余部分代码与之前的设置保持一致,重点在于generator.py模块:
With the retrieval pipeline in place, this section now shifts focus to the generator, which plays a pivotal role in transforming retrieved context into natural language responses. While the rest of the code remains consistent with the earlier setup, the emphasis here is on the generator.py module:
该组件为 RAG 工作流程中的文本生成提供了一个简洁且模块化的接口,将模型初始化(init_generator )与实际生成逻辑(generate_response )清晰地分离。这种设计提高了可重用性,简化了集成,并且与提示工程和 LLM 抽象的最佳实践高度契合。
This component offers a clean and modular interface for text generation within the RAG workflow, clearly separating model initialization (init_generator) from the actual generation logic (generate_response). This design promotes reusability, simplifies integration, and aligns well with best practices in prompt engineering and LLM abstraction.
其余代码保持不变;然而,主要关注点在于生成器部分。generator.py模块为 RAG 设置中的文本生成提供了一个简洁且模块化的接口。它将模型初始化(init_generator )与生成过程(generate_response )分离,从而提高了设计的可重用性和清晰度。该架构符合提示工程和模型抽象方面的最佳实践。
The rest of the code remains the same; however, the main focus is on the generator part. The generator.py module provides a clean and modular interface for text generation in a RAG setup. It separates the model initialization (init_generator) from the generation process (generate_response), promoting reusability and clarity in design. The architecture aligns with best practices in prompt engineering and model abstraction.
通过将生成逻辑与检索机制分离,该模块保持了跨多种模态使用的灵活性,前提是上下文能够以文本形式表示。这种解耦在多智能体或多模态系统中尤为重要,因为同一个生成模块可以在不同的输入源中重复使用。
By abstracting the generation logic from retrieval mechanics, the module remains flexible for use across multiple modalities, provided the context can be represented textually. This decoupling is particularly important in multi-agent or multimodal systems where the same generation module can be reused across varied sources of input.
你是助理。请根据以下问题和上下文,提供相关且连贯的答案。
查询:{query}
语境:
{语境}
回答:
此模板旨在通过明确定义模型的角色(助手)以及生成答案前应考虑的输入字段来指导模型的行为。使用独立的查询和上下文部分可确保输入格式的结构化,从而提高模型响应的可靠性和连贯性。
You are an assistant. Based on the following query and context, provide a relevant and coherent answer.
Query: {query}
Context:
{context}
Answer:
This template is designed to guide the model's behavior by clearly defining its role (assistant) and the input fields it should consider before generating an answer. The use of distinct sections for Query and Context ensures structured input formatting, improving grounding and coherence in the model’s responses.
完整的代码可以在第 8 章“构建多模态 RAG 系统”的multimodal_rag_system.py部分找到。
The end-to-end code can be found in Chapter 8, Building a Multimodal RAG System, section: multimodal_rag_system.py.
在多模态 RAG 的基础上,LLM 利用文本、图像和结构化数据等多种模态来增强信息访问和综合,我们现在过渡到一个相关但又不同的应用:多模态推荐系统。RAG 侧重于检索和生成上下文丰富的响应,而多模态推荐系统则利用类似的跨模态理解来预测和推荐符合用户偏好的相关内容。本章将探讨如何将 RAG 所依赖的相同功能(嵌入对齐、多模态融合和语义理解)应用于跨行业和平台,从而提供高度个性化、多样化且上下文感知的推荐。
Building on the foundations of multimodal RAG, where LLMs leverage diverse modalities such as text, images, and structured data to enhance information access and synthesis, we now transition to a related yet distinct application: multimodal recommendation systems. While RAG focuses on retrieving and generating contextually rich responses, multimodal recommendation systems use similar cross-modal understanding to predict and suggest relevant content tailored to user preferences. This chapter explores how the same capabilities that empower RAG, embedding alignment, multimodal fusion, and semantic understanding, are adapted to deliver highly personalized, diverse, and context-aware recommendations across industries and platforms.
在OTT平台上,多模态LLM (MLLM )可以通过整合文本描述、宣传图片、视频缩略图、用户评论和观看历史,彻底革新内容推荐方式。例如,如果一位新用户观看过画面风格阴暗的预告片并阅读过惊悚片的剧情简介,该模型就可以推断出其细微的类型偏好,例如黑色电影。即使观看历史有限,LLM 也能推荐带有心理元素的惊悚片。这使得即使在冷启动场景下也能提供有效的推荐,而依赖元数据或用户相似度的传统系统在冷启动时可能表现不佳。通过整合多模态信号,LLM 可以增强内容发现和用户参与度,并根据用户的潜在和显性偏好定制推荐内容。
On an OTT platform, a multimodal LLM (MLLM) can revolutionize content recommendation by integrating textual descriptions, promotional images, video thumbnails, user reviews, and viewing history. For instance, if a new user watches trailers with dark cinematography and reads thriller plotlines, the model can infer nuanced genre preferences, such as noir thrillers with psychological elements, despite limited watch history. This enables effective recommendations even in cold-start scenarios, where traditional systems relying on metadata or user similarity may falter. By aligning multimodal signals, LLMs enhance both discovery and engagement, tailoring suggestions to the user’s implicit and explicit tastes.
多层逻辑模型(MLLM)能够理解和整合多种数据类型——文本、图像、音频甚至视频——从而发挥强大的推荐引擎作用。传统的推荐系统通常仅依赖于协同过滤或结构化元数据,这在冷启动场景下可能表现不佳,也难以捕捉到用户细微的偏好。相比之下,MLLM 可以从各种内容源和用户交互中提取丰富的高维嵌入,从而实现更加个性化和情境感知的推荐。
A MLLM can function as a powerful recommendation engine by leveraging its ability to understand and integrate diverse data types—text, images, audio, and even video. Traditional recommendation systems often rely solely on collaborative filtering or structured metadata, which can struggle in cold-start scenarios or fail to capture nuanced user preferences. In contrast, MLLMs extract rich, high-dimensional embeddings from varied content sources and user interactions, enabling more personalized and context-aware recommendations.
例如,CLIP 或 GPT-4V 等模型能够理解产品描述和视觉美学,因此非常适合推荐时尚、家居装饰或多媒体内容。逻辑逻辑模型 (LLM) 可以总结用户历史记录,从查询中推断用户意图,并将其与跨模态的相关项目进行匹配。它们还支持可解释性,例如为推荐内容生成自然语言解释,从而增强用户信任度和满意度。
For instance, models like CLIP or GPT-4V can understand both product descriptions and visual aesthetics, making them ideal for recommending fashion, home decor, or multimedia content. LLMs can summarize user histories, infer intent from queries, and match them with relevant items across modalities. They also enable explainability, like generating natural language justifications for recommendations, which enhances trust and user satisfaction.
诸如基于协同过滤对齐的MLLM(用于增强序列推荐,简称Molar )、基于LLM的多模态推荐结合用户历史编码和压缩(简称HistLLM )以及偶然性MLLM等先进系统,已展现出实际应用价值,在个性化、新颖性和用户参与度指标方面均优于传统方法。凭借分层规划和压缩的用户历史数据,这些模型能够实时提供可扩展且多样化的推荐。随着LLM技术的不断发展,它们有望成为构建下一代多模态推荐引擎的基础,并应用于各个行业。
Advanced systems like MLLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation (Molar), LLM-Based Multimodal Recommendation with User History Encoding and Compression (HistLLM), and serendipitous MLLM have already demonstrated real-world impact, outperforming conventional approaches in personalization, novelty, and engagement metrics. With hierarchical planning and compressed user histories, these models support scalable and diverse recommendations in real-time. As LLMs continue to evolve, they are poised to become foundational in building next-generation, multimodal recommendation engines across industries.
诸如基于图增强的推荐逻辑逻辑模型(LLMRec )等新兴架构,通过将逻辑逻辑模型驱动的推理直接嵌入交互图,进一步扩展了这一范式。这些系统不仅解释内容,而且还主动利用逻辑逻辑模型生成的推断关系、丰富的物品元数据和用户意图画像来增强推荐图。通过将逻辑逻辑模型的功能与基于图的模型的结构优势相结合,基于排序的推荐逻辑逻辑模型(LlamaRec )能够提升语义深度和推荐准确率,尤其是在数据稀疏的情况下。
Emerging architectures such as LLMs with Graph Augmentation for Recommendation (LLMRec) expand this paradigm further by embedding LLM-driven reasoning directly into interaction graphs. These systems do not just interpret content, but rather, they actively augment recommendation graphs with inferred relationships, enriched item metadata, and user intent profiles generated by LLMs. By combining LLM capabilities with the structural power of graph-based models, LLMs for ranking-based recommendation (LlamaRec) enhance both semantic depth and recommendation accuracy, particularly in sparse data scenarios.
本节探讨了基于大型语言模型(LLM)的多模态推荐系统的领先架构和最新创新。内容涵盖了诸如多模态推荐系统(MMRec )、Molar、HistLLM、LLMRec 等模型,这些模型融合了文本、图像和行为信号,以提供个性化、上下文感知且可解释的推荐。本节讨论了多模态嵌入融合、图增强、历史压缩和偶然发现等关键设计策略,以及 Ducho 2.0 和 ATFLRec(基于指令调优大型语言模型的音频-文本融合和低秩自适应多模态推荐系统)等支持工具。此外,本节还概述了一个实用的方案。实施路线图,并重点介绍这些系统在处理冷启动、增强多样性和提高参与度方面的优势。详情如下:
This section explores leading architectures and recent innovations in multimodal recommendation systems powered by LLMs. It covers models like Multimodal Recommender System (MMRec), Molar, HistLLM, LLMRec, and others that integrate text, image, and behavioral signals to deliver personalized, context-aware, and explainable recommendations. Key design strategies such as multimodal embedding fusion, graph augmentation, history compression, and serendipitous discovery are discussed alongside supporting tools like Ducho 2.0 and ATFLRec (A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model). The section also outlines a practical implementation roadmap and highlights the advantages of these systems in handling cold-starts, enhancing diversity, and improving engagement. Details are as follows:
下表对主要的多种模态推荐系统进行了比较概述,总结了它们的核心策略、处理的模态、创新点、冷启动能力、系统兼容性以及所使用的工具或框架。此比较突显了将低层次模型(LLM)与多模态信号集成以提供可扩展、个性化和智能推荐体验的各种方法。
The following table provides a comparative overview of prominent multimodal recommendation systems, summarizing their core strategies, modalities handled, innovations, cold-start capabilities, system compatibility, and the tools or frameworks used. This comparison highlights the diverse approaches through which LLMs are being integrated with multimodal signals to deliver scalable, personalized, and intelligent recommendation experiences.
|
型号名称 Model name |
核心战略 Core strategy |
模态融合 Modality fusion |
关键创新/优势 Key innovation/strength |
冷启动处理 Cold-start handling |
兼容性 Compatibility |
使用的工具/框架 Tools/ frameworks used |
|
MMREC MMREC |
多模态嵌入 | 深度排序模型。 Multimodal embedding | deep ranking model. |
文字+图片 Text + image |
将多种模态结合在一个共享的潜在空间中;具有很强的假阳性控制能力。 Combines modalities in a shared latent space; strong false-positive control. |
缓和 Moderate |
深度排名管道。 Deep ranking pipelines. |
PyTorch、Transformers、ResNet-50。 PyTorch, Transformers, ResNet-50. |
|
磨牙 Molar |
协同过滤与多模态输入对齐。 Collaborative filtering alignment with multimodal input. |
文本 + 图像 + 行为 Text + image + behavior |
将项目嵌入与序列中的用户行为对齐。 Aligns item embeddings with user behavior in sequences. |
高的 High |
顺序推荐系统。 Sequential recommender systems. |
PyTorch、Hugging Face、基于自注意力的序列模型(SASRec )。 PyTorch, Hugging Face, self-attention-based sequential model (SASRec). |
|
HistLLM HistLLM |
使用LLM提示令牌进行历史记录压缩。 History compression using LLM prompt token. |
所有用户交互 All user interactions |
将完整的用户历史记录编码到单个令牌中,以便快速推理。 Encodes full user history into a single token for fast inference. |
高的 High |
基于LLM的推理。 LLM-based inference. |
OpenAI API、LangChain、Faiss。 OpenAI API, LangChain, Faiss. |
|
意外的LLM Serendipitous LLM |
基于层级规划的意图建模。 Intent modeling with hierarchical planning. |
文本 + 上下文特征 Text + contextual features |
既要追求新颖性,又要保持相关性。 Promotes novelty while preserving relevance. |
高的 High |
个性化探索。 Personalized exploration. |
Llama,快速注射,PlannerX。 Llama, prompt injection, PlannerX. |
|
LLMRec LLMRec |
基于LLM的用户图增强+噪声过滤。 LLM-driven user graph augmentation + noise-filtering. |
文本 + 图表 + 属性 Text + graph + attributes |
增强稀疏环境下的鲁棒性;与模型无关。 Enhances robustness in sparse environments; model-agnostic. |
非常高 Very high |
GNN、MF混合系统。 GNNs, MF hybrid systems. |
Neo4j、GraphSAGE、OpenAI、DGL。 Neo4j, GraphSAGE, OpenAI, DGL. |
Table 8.1: Model comparison overview
基于多层逻辑模型(MLLM)的推荐引擎代表了个性化内容推送的下一个发展阶段。这些系统融合了深度多模态感知和自然语言推理的优势,能够提供更精准的相关性、更丰富的上下文理解和更高的用户满意度。它们尤其擅长处理冷启动场景,生成多样化的推荐,并通过可解释且直观的推荐来提升用户参与度。
MLLM-based recommendation engines represent the next evolution in personalized content delivery. By leveraging the combined strengths of deep multimodal perception and natural language reasoning, these systems offer superior relevance, contextual understanding, and user satisfaction. They are especially useful in handling cold-start scenarios, generating diverse suggestions, and enhancing user engagement through explainable and intuitive recommendations.
在探索了多模态推荐系统的功能和设计原则之后,我们发现,提供高质量的推荐仅仅是成功的一部分。同样重要的是,要能够以结构化且可靠的方式评估、排序和验证这些推荐。这就引出了智能系统的下一个关键方面:评分。在接下来的章节中,我们将重点从推荐生成转向推荐评估,探讨如何应用基于规则和模型驱动的评分机制来对用户反馈进行评分、对推荐进行排序,并确保系统输出符合用户期望和特定领域的标准。
Having explored the capabilities and design principles of multimodal recommendation systems, it becomes evident that delivering high-quality suggestions is only one part of the equation. Equally important is the ability to assess, rank, and validate these recommendations in a structured and reliable manner. This brings us to the next critical aspect of intelligent systems: grading. In the following chapter, we shift our focus from generation to evaluation, examining how grading mechanisms, both rule-based and model-driven, can be applied to score responses, rank recommendations, and ensure system outputs meet user expectations and domain-specific standards.
正如第六章“两阶段和多阶段GenAI系统”中所讨论的,评分在验证和优化多模态RAG系统的输出质量方面起着至关重要的作用。如果没有健全的评分机制,一些问题会损害系统的可靠性和用户信任。首先,缺乏质量控制可能导致生成无关、不连贯或虚假的响应,尤其是在结合文本、图像和视频等多种模态时。这会降低用户体验,并削弱推荐或答案的可信度。其次,没有评分的系统无法进行自我评估或随着时间的推移而改进,导致系统停滞不前。 随着知识库的演进,性能甚至可能下降。在医疗保健、教育或金融等安全关键领域,未经分级的输出可能导致错误信息或带有偏见的推荐,造成严重后果。第三,缺乏反馈回路会阻碍微调和模型对齐工作,从而妨碍自适应个性化或性能优化。此外,无法对候选输出进行排序会削弱多候选选择策略,而这些策略原本可以促进多样性和创新性。最后,在多智能体或混合 RAG 设置中,需要评估来自不同检索或推理模块的输出以达成共识,分级对于协调决策至关重要。总之,分级不仅仅是一个后处理步骤,它是确保多模态 RAG 系统准确性、可信度和适应性的基础。如图8.3所示,分级过程位于核心检索和嵌入操作的下游,并作为检索相关性和生成响应质量的智能评估机制:
As discussed in Chapter 6, Two and Multi-stage GenAI Systems, grading plays a critical role in validating and optimizing the output quality of multimodal RAG systems. Without a robust grading mechanism, several issues can compromise system reliability and user trust. First, the absence of quality control may lead to the generation of irrelevant, incoherent, or hallucinated responses, especially when combining diverse modalities like text, images, and video. This degrades user experience and undermines the credibility of recommendations or answers. Second, systems without grading cannot self-assess or improve over time, leading to stagnant or even deteriorating performance as the knowledge base evolves. In safety-critical domains such as healthcare, education, or finance, ungraded outputs can cause misinformation or biased recommendations with serious consequences. Third, the lack of a feedback loop hinders fine-tuning and model alignment efforts, preventing adaptive personalization or performance optimization. Furthermore, the inability to rank candidate outputs weakens multi-candidate selection strategies that could otherwise promote diversity and novelty. Finally, in multi-agent or hybrid RAG setups, where outputs from different retrieval or reasoning modules need to be evaluated for consensus, grading becomes essential for orchestrated decision-making. In summary, grading is not just a post-processing step. It is foundational to ensuring accuracy, trustworthiness, and adaptability in multimodal RAG systems. As shown in Figure 8.3, the grading process is situated downstream of the core retrieval and embedding operations and serves as an intelligent evaluation mechanism for both retrieval relevance and generative response quality:
该流程始于用户查询,查询内容会同时通过文本和图像嵌入模型进行处理。这些模型会生成输入的向量表示,然后使用这些向量表示查询一个包含从文档和图像中提取的多模态嵌入的向量数据库。在存储之前,文档会被分割成块,并与任何关联的图像一起嵌入,以支持细粒度的语义检索。
The pipeline begins with a user query, which is simultaneously processed through text and image embedding models. These models generate vector representations of the input, which are then used to query a vector database containing multimodal embeddings derived from both documents and images. Before storage, the documents are segmented into chunks and embedded alongside any associated images to support fine-grained semantic retrieval.
向量数据库返回相关结果的排名列表后,将调用评分组件。该组件由一个承担双重角色的 LLM 提供支持:
Once the vector database returns a ranked list of relevant results, the grading component is invoked. This component is powered by an LLM operating in a dual role:
这些评分模块共同构成一个反馈回路,不仅决定向用户呈现哪些结果,还能对检索和生成机制进行微调。通过利用语言学习模型(LLM)作为评分器,该系统确保输出质量能够通过先进的语言理解能力进行持续评估,而不是依赖静态的启发式规则。
Together, these grading modules act as a feedback loop, which not only determines which results are presented to the user but also enables fine-tuning of retrieval and generation mechanisms. By leveraging the LLM as a grader, the system ensures that output quality is continually assessed through advanced language understanding capabilities rather than relying on static heuristic rules.
该框架通过集成智能自动评分来提升多模态 RAG 系统的实用性,确保用户在基于检索和生成式的交互中都能获得最相关、最高质量的结果。
This framework elevates the utility of multimodal RAG systems by integrating intelligent, automated grading, ensuring that users receive the most relevant, high-quality results in both retrieval-based and generative interactions.
以下部分将详细介绍各个组件,并附上解释和嵌入式代码。
The following section provides a breakdown of the components with explanations and embedded code.
该脚本使用 LangChain 的组件来实现 LLM、提示模板和链式编排:
The script uses LangChain’s components for LLMs, prompt templates, and chain orchestration:
from langchain.chat_models import ChatOpenAI
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains import LLMChain
这些库允许与 OpenAI 的 GPT 模型集成,并能够动态构建基于 LLM 的工作流。
These libraries allow integration with OpenAI's GPT models and enable dynamic construction of LLM-based workflows.
该组件根据检索到的上下文评估生成的响应与用户查询的契合度。它使用语言模型,根据检索到的上下文,给出 1 到 5 分的评分,并给出理由,从而能够精确评估响应的质量、连贯性和与用户原始意图的一致性,详情如下:
This component evaluates how well a generated response answers the user’s query based on the retrieved context. It uses a language model to assign a score from one to five, along with a justification, enabling precise assessment of response quality, coherence, and alignment with the original user intent, details as follows:
def init_grader():
llm = ChatOpenAI(temperature=0.3, model=”gpt-3.5-turbo”)
提示 = PromptTemplate(
input_variables=["query", "context", "response"],
template="""评估以下生成的响应的质量。
查询:{query}
上下文:{上下文}
回复:{response}
请从 1 到 5 分进行评分,并解释原因。
评分及理由:
)
返回 LLMChain(llm=llm, prompt=prompt)
def init_grader():
llm = ChatOpenAI(temperature=0.3, model=”gpt-3.5-turbo”)
prompt = PromptTemplate(
input_variables=["query", "context", "response"],
template="""Evaluate the quality of the following generated response.
Query: {query}
Context: {context}
Response: {response}
Give a score from 1 to 5 and explain why.
Score and Justification:"""
)
return LLMChain(llm=llm, prompt=prompt)
Python
编辑
def grade_response(grader_chain, query: str, context: str, response: str):
返回 grader_chain.run({
"查询": 查询,
“上下文”:上下文,
“响应”:响应
})
python
CopyEdit
def grade_response(grader_chain, query: str, context: str, response: str):
return grader_chain.run({
"query": query,
"context": context,
"response": response
})
以下列表概述了用途、初始化和执行函数:
The following list outlines the purposes, initialization, and execution functions:
def init_retrieval_grader():
llm = ChatOpenAI(temperature=0, model=”gpt-3.5-turbo”)
提示 = PromptTemplate(
input_variables=["question", "document"],
template="""您是一名评分员,正在评估检索到的文档与用户问题的相关性。
如果文档包含与问题相关的关键词或语义信息,则将其评为相关。
以下是检索到的文档:
{文档}
以下是用户提出的问题:
{问题}
仔细客观地评估该文件是否包含至少一些与问题相关的信息。
返回一个 JSON 对象,该对象包含一个名为 binary_score 的键,其值为“yes”或“no”。
)
返回 LLMChain(llm=llm, prompt=prompt)
def init_retrieval_grader():
llm = ChatOpenAI(temperature=0, model=”gpt-3.5-turbo”)
prompt = PromptTemplate(
input_variables=["question", "document"],
template="""You are a grader assessing relevance of a retrieved document to a user question.
If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant.
Here is the retrieved document:
{document}
Here is the user question:
{question}
Carefully and objectively assess whether the document contains at least some information that is relevant to the question.
Return a JSON object with a single key, binary_score, that is either 'yes' or 'no'."""
)
return LLMChain(llm=llm, prompt=prompt)
def grade_document_relevance(grader_chain, question: str, document: str):
返回 grader_chain.run({
“问题”:问题,
“文档”:文档
})
def grade_document_relevance(grader_chain, question: str, document: str):
return grader_chain.run({
"question": question,
"document": document
})
这些评分员对于开发智能和自我评估的 RAG 系统至关重要,其中反馈回路有助于提高可靠性、可解释性和用户满意度。
These graders are crucial for developing intelligent and self-evaluating RAG systems where feedback loops help improve reliability, explainability, and user satisfaction.
在自然语言系统中,评分和生成是截然不同的任务,因此需要专门的模型或算法链才能达到最佳性能。生成任务是指根据输入提示或检索到的上下文,创建流畅、与上下文相关且符合用户需求的回复。它优先考虑创造性、连贯性和意图满足性。相比之下,评分是一项评估任务,需要客观性、一致性和批判性推理来评估回复或检索内容的质量、正确性或相关性。对这两个任务使用相同的模型可能会引入冲突:生成模型在作为评分器时可能会表现出确认偏差,倾向于自身的输出,从而损害评估的公平性。此外,针对生成任务优化的提示通常鼓励冗长和假设形成,而评分提示则要求精确、简洁和分析严谨。从系统设计的角度来看,将这两个角色分开可以实现针对特定任务的提示设计、温度设置和评分标准。这种模块化设计增强了可解释性,实现了生成性能的基准测试,并允许针对每个任务进行独立的更新或模型选择。如图 8.4所示,因此,使用不同的 LLM 或链进行评分和生成符合负责任的 AI 系统设计的最佳实践,并确保 RAG 工作流程中更稳健、透明和负责的建议。
Grading and generation are fundamentally distinct tasks in natural language systems and therefore require specialized models or chains to achieve optimal performance. Generation involves creating fluent, contextually relevant, and user-aligned responses based on input prompts or retrieved context. It prioritizes creativity, coherence, and intent satisfaction. In contrast, grading is an evaluative task that demands objectivity, consistency, and critical reasoning to assess the quality, correctness, or relevance of a response or retrieved content. Using the same model for both tasks can introduce conflicts: a generative model may exhibit confirmation bias by favoring its own outputs when acting as a grader, thus undermining fairness in evaluation. Additionally, prompts optimized for generation typically encourage verbosity and hypothesis formation, whereas grading prompts require precision, brevity, and analytical rigor. From a systems design perspective, separating these two roles allows task-specific prompt engineering, temperature settings, and scoring criteria. This modularity enhances explainability, enables benchmarking of generation performance, and allows independent updates or model selection per task. As shown in Figure 8.4, consequently, using distinct LLMs or chains for grading and generation aligns with best practices in responsible AI system design and ensures more robust, transparent, and accountable recommendations in RAG workflows.
使用基于云的语言学习模型 (LLM) 进行评分相比本地部署具有显著优势,尤其是在可靠性、可扩展性和性能方面。例如,OpenAI 的 GPT-3.5 或 GPT-4 等云 LLM 受益于持续的微调、对海量训练数据的访问以及难以在本地环境中实现的优化基础设施。这些模型会定期更新,以适应最新的语言趋势、推理能力的提升以及安全过滤器的改进,从而对查询响应质量或文档相关性进行更一致、更准确的评估。此外,云 LLM 通常部署在高性能硬件上,能够实现大规模的快速推理,这对于生产环境中的实时或大批量评分任务至关重要。相比之下,本地 LLM 通常会受到 GPU 资源有限和权重过时的限制,这可能会降低评分的准确性。此外,在本地模型上实施版本控制、偏差缓解和及时安全措施也会带来诸多问题。这需要大量的工程投入。对于学术和企业系统而言,评估的稳健性和准确性至关重要,因此利用基于云的语言学习模型(LLM)作为评分工具,可以确保更高的可信度、最新的语言知识和更高的标准化程度,使其成为超越成本考量的更佳选择。
Grading using cloud-based LLMs offers significant advantages over local deployments, especially in the context of reliability, scalability, and performance. Cloud LLMs, such as OpenAI’s GPT-3.5 or GPT-4, benefit from continuous fine-tuning, access to extensive training data, and infrastructure optimizations that are difficult to replicate on-premise. These models are regularly updated to align with the latest linguistic trends, reasoning improvements, and safety filters, resulting in more consistent and accurate evaluations of query-response quality or document relevance. Furthermore, cloud LLMs are typically deployed on high-performance hardware that allows for rapid inference at scale, which is essential for real-time or large-batch grading tasks in production environments. In contrast, local LLMs are often constrained by limited GPU resources and outdated weights, which can degrade grading fidelity. Additionally, implementing version control, bias mitigation, and prompt safety measures on local models requires significant engineering effort. For academic and enterprise systems where robustness and accuracy of evaluation are critical, leveraging cloud-based LLMs as graders ensures higher trustworthiness, up-to-date linguistic knowledge, and greater standardization, making them a superior choice despite cost considerations.
您可以在第 8 章“构建多模态 RAG 系统”的 GitHub 存储库中找到代码,位于Chapter_8_multimodal_rag_system_Grader.py ,其中包括使用本地 LLM 进行评分。
You can find the code in the GitHub repository of Chapter 8, Building a Multimodal RAG System under Chapter_8_multimodal_rag_system_Grader.py, including grading with local LLM.
通过观察图 8.4,您可能会开始认识到一个新兴概念,即LLM 作为评判者,如果是这样,那就说明已经建立了一种理解。
By examining Figure 8.4, you may begin to recognize an emerging concept known as LLM-as-a-judge, and if so, there is an understanding that has been established.
LLM作为评判者是指利用LLM来评估、评分或排序其他AI系统的输出,尤其是在生成、检索、摘要或推理等任务中。LLM无需使用硬编码规则或人工评分,而是被赋予智能评估者的角色。
LLM-as-a-judge refers to the use of an LLM to evaluate, grade, or rank the outputs of other AI systems, especially in tasks like generation, retrieval, summarization, or reasoning. Instead of using hard-coded rules or human raters, the LLM is prompted to act as an intelligent evaluator.
图 8.5展示了一种最佳实践架构模式,其中评分和生成由独立的 LLM 处理,每个 LLM 都针对不同的用途进行了优化。本地 LLM 用于内容生成任务,确保快速、经济高效且可离线运行。同时,云端 LLM 充当公正的评判者,负责评估检索相关性和响应质量。这种角色分离能够实现更客观的评估,提高反馈回路的完整性,并避免自我评估带来的偏差。使用云端 LLM 进行评判可确保评分的一致性和高质量,并与更广泛的语义理解保持一致,尤其适用于下游任务中所需的复杂或细致的评估。
Figure 8.5 illustrates a best practice architectural pattern where grading and generation are handled by separate LLMs, each optimized for a distinct purpose. A local LLM is used for content generation tasks, ensuring fast, cost-efficient, and offline operation. In parallel, a cloud-hosted LLM acts as an impartial judge, responsible for evaluating both the retrieval relevance and response quality. This separation of roles enables more objective assessment, improves feedback loop integrity, and avoids bias from self-evaluation. The use of cloud LLMs for judgment ensures consistent, high-quality grading aligned with broader semantic understanding, especially for complex or nuanced evaluations required in downstream tasks.
如图 8.5所示, LLM 作为评判者的工作原理是:向一个功能强大的 LLM(例如 GPT-4 或 GPT-3.5 Turbo)提供明确的评价标准(例如相关性、准确性、清晰度或一致性),并要求其根据这些标准评估或比较输出结果。三种常见的方法包括:
LLM-as-a-judge operates, as shown in Figure 8.5, by prompting a capable LLM (e.g., GPT-4 or GPT-3.5 Turbo) with an explicit rubric, such as relevance, accuracy, clarity, or consistency, and asking it to evaluate or compare outputs based on these criteria. Three common approaches include:
以下描述了它如何应用于我们的系统。
The following describes how it is applied to our system.
在我们的系统中:
In our system:
两者都是语言学习模型作为评估者的经典例子,它们使用自然语言提示和结构化输出进行主观或语义判断。
Both are classic examples of LLMs acting as evaluators, making subjective or semantic judgments using natural language prompts and structured outputs.
目前grader.py中的检索相关性评分器仅能评估文本内容。具体来说,它需要一个文档(假定为文本块)和一个问题,然后根据语义或关键词重叠情况来判断该文档是否与查询相关。这种方法对于评估从语料库中检索到的文本非常有效,但不适用于图像等视觉内容。
The current implementation of the retrieval relevance grader in grader.py is limited to evaluating textual content. Specifically, the prompt expects a document, assumed to be a text chunk, and a question, then determines whether the document is relevant to the query based on semantic or keyword overlap. This approach is effective for evaluating text retrieved from a corpus, but does not apply to visual content such as images.
为了扩展评分系统以支持图像相关性评估,读者应考虑实施以下改进措施之一:
To extend the grading system to support image relevance evaluation, the reader should consider implementing one of the following enhancements:
实施这两项改进中的任何一项,都将使相关性评分系统更加稳健,并能更好地包容多模态内容,使其与现实世界多模态人工智能系统中端到端 RAG 的更广泛目标保持一致。
Implementing either of these enhancements would make the relevance grading system more robust and inclusive of multimodal content, aligning it with the broader goals of end-to-end RAG in real-world multimodal AI systems.
本章探讨了构建智能、人性化人工智能系统所必需的核心组件的集成。首先,本章展示了语言学习模型(LLM)如何生成与上下文相关的响应。随后,本章将这一概念扩展到多模态推荐领域,其中文本和视觉输入共同指导检索和个性化推荐。读者还学习了如何使用 OpenAI 模型集成评分机制,从而实现对检索内容和生成输出的自动、可扩展评估。本章最后阐述了“语言学习模型作为评判者”的概念,强调了语言学习模型在语义丰富、人性化的评估过程中的作用。
In this chapter, readers explored the integration of core components essential for building intelligent, human-aligned AI systems. Starting with generation, the chapter demonstrated how LLMs can produce contextually relevant responses. This was extended into the realm of multimodal recommendation, where text and visual inputs jointly informed retrieval and personalization. Readers also learned how to incorporate grading mechanisms using OpenAI models, enabling automatic, scalable evaluation of both retrieved content and generated outputs. The chapter culminated with the concept of LLM-as-a-judge, emphasizing the role of LLMs in semantically rich, human-aligned evaluation processes.
在奠定坚实基础之后,下一章将通过引入重排序层来扩展此架构。重排序层是一项关键的增强功能,可在生成之前进一步提升检索质量。读者将了解重排序器如何根据语义相关性、事实依据或用户偏好,有选择地对最佳候选结果进行优先级排序。这一新增功能在多模态 RAG流程中发挥着至关重要的作用,确保输入到 LLM 进行生成的内容不仅相关,而且排序最优。通过这种方式,我们更接近于设计出能够跨模态进行动态推理的强大、可解释且高实用性的 AI 系统。
Having established a strong foundation, the next chapter will extend this architecture by introducing a reranking layer, a critical enhancement that further refines retrieval quality before generation. Readers will understand how rerankers selectively prioritize top candidates based on semantic relevance, factual grounding, or user preferences. This addition plays a vital role in multimodal RAG pipelines, ensuring that the content fed into the LLM for generation is not only relevant but optimally ranked. Through this, we move closer to designing robust, explainable, and high-utility AI systems capable of dynamic reasoning across modalities.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
在日益视觉化和互联互通的数字世界中,跨不同模态(例如文本和图像)搜索和检索信息的能力已成为高级人工智能( AI ) 应用的基石。本章将介绍多模态检索的概念,即系统旨在理解和关联文本和视觉输入。与仅依赖文本相似性的传统搜索引擎不同,多模态系统利用图像和文本的矢量表示来提供更丰富、更具上下文关联性的搜索结果。您将学习如何构建这样一个系统:集成 Qdrant 作为矢量数据库,使用 Hugging Face 的对比语言-图像预训练( CLIP ) 模型生成图像嵌入,并使用 LangChain 来协调检索过程。这些工具支持对多种数据格式的统一访问,使用户能够执行灵活的跨模态搜索,例如从图像中检索描述或识别与文本输入匹配的图像。
In an increasingly visual and interconnected digital world, the ability to search and retrieve information across different modalities, such as text and images, has become a cornerstone of advanced artificial intelligence (AI) applications. This chapter introduces the concept of multimodal retrieval, where systems are designed to understand and correlate both textual and visual inputs. Unlike traditional search engines that rely solely on textual similarity, multimodal systems use vector representations from both images and text to deliver richer, more contextually aligned results. You will learn how to build such a system by integrating Qdrant as a vector database, Contrastive Language-Image Pretraining (CLIP) models from Hugging Face for generating image embeddings, and LangChain to orchestrate the retrieval process. These tools enable unified access to multiple data formats, allowing users to perform flexible cross-modal searches, such as retrieving descriptions from images or identifying images that match textual inputs.
本章将指导您构建双索引向量存储,并开发能够处理各种查询格式的混合检索器。基于 Python 的实现将引导您完成索引工作流、嵌入管道以及在不同模态之间无缝切换的检索逻辑。除了技术架构之外,本章还将深入探讨一些实用的设计决策,例如相似度评分、模态优先级排序和自定义检索逻辑。最终,您将掌握部署生产就绪型多模态检索器的技能。本课程的基础理论适用于电子商务推荐、视觉内容发现和语义搜索引擎等应用场景。这种实践性强的教学方法不仅能确保您理解理论,还能让您掌握实施可扩展的实际解决方案的能力。
Throughout the chapter, you will construct dual-index vector stores and develop hybrid retrievers capable of handling diverse query formats. Python-based implementations will guide you through indexing workflows, embedding pipelines, and retrieval logic that switches seamlessly between modalities. Beyond technical architecture, the chapter delves into practical design decisions like similarity scoring, modality prioritization, and custom retrieval logic. By the end, you will have the skills to deploy a production-ready multimodal retriever—a foundation applicable to use cases in e-commerce recommendations, visual content discovery, and semantic search engines. This hands-on approach ensures you not only understand the theory but also gain the ability to implement scalable, real-world solutions.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章探讨信息检索和多模态检索增强生成(RAG )系统中的重排序。它介绍了关键的重排序器类别,并重点介绍了用于优化检索结果的交叉编码器。读者将了解交叉编码器的架构、多模态上下文中的多索引嵌入(涉及图像和文本),以及这些模型如何提高语义精度。一个实用的代码示例演示了如何在多模态 RAG 流程中实现和集成基于交叉编码器的重排序器。本章最后提供了一个实践练习,鼓励读者完成缺失的组件,并通过实际操作巩固理解。
This chapter explores reranking in information retrieval and multimodal retrieval-augmented generation (RAG) systems. It introduces key reranker categories, with a special focus on cross-encoders for refining retrieved results. Readers will understand the architecture of cross-encoders, multi-index embedding in multimodal contexts, where both images and text are involved, and how these models enhance semantic precision. A practical code walk-through demonstrates how to implement and integrate a cross-encoder-based reranker in a multimodal RAG pipeline. The chapter concludes with a hands-on to do, challenging readers to complete missing components and solidify their understanding through active implementation.
基于第 1 章“新时代生成式人工智能简介”、第 6 章“两阶段和多阶段生成式人工智能系统”以及第 8 章“构建多模态 RAG 系统”中介绍的基础概念,我们来理解图 9.1。图 9.1展示了一个两阶段 RAG 架构,该架构集成了一个交叉编码器重排序器以提高结果精度。工作流程始于用户查询,查询经过输入后进入检索流程。同时,语料库中的文档被分块,并通过嵌入模型(例如基于 Transformer 的编码器)生成密集向量表示。这些向量表示存储在向量数据库中。
Building upon the foundational concepts introduced in Chapters 1, Introducing New Age generative AI, 6, Two and Multi-stage GenAI Systems, and 8, Building a Multimodal RAG System, let us understand Figure 9.1, which illustrates a two-stage RAG architecture that incorporates a cross-encoder reranker for enhanced result precision. The workflow begins with a user query that passes through input before proceeding to the retrieval pipeline. Simultaneously, documents in the corpus are chunked and passed through an embedding model, such as a transformer-based encoder, to generate dense vector representations. These are stored in a vector database.
在查询时,用户查询被编码成一个向量,并使用近似最近邻(ANN )搜索将其与存储的文档嵌入进行比较,从而检索出最相似的前k个候选文档。这些向量搜索结果随后被转发到一个交叉编码器重排序器,该重排序器联合处理原始查询和每个候选文档,通过完整的词元级交互来计算更细粒度的相似度得分。重排序器基于语义相关性对结果进行重新排序,从而生成一组更准确的前k个重排序文档。
At query time, the user query is encoded into a vector and compared against the stored document embeddings using approximate nearest neighbor (ANN) search, retrieving the top-k most similar candidates. These vector search results are then forwarded to a cross-encoder reranker, which jointly processes the original query and each candidate document to compute fine-grained similarity scores via full token-level interaction. The reranker reorders the results based on semantic relevance, producing a more accurate set of top-k reranked documents.
这些重新排序后的文档连同原始用户查询一起被送入大型语言模型(LLM )进行合成。LLM生成最终答案并返回给用户。这种两阶段设计兼顾了可扩展性(通过双编码器检索)和精确性(通过交叉编码器重排序),从而实现了高效且高质量的响应生成。
These reranked documents, along with the original user query, are passed into the large language model (LLM) for synthesis. The LLM generates the final answer, which is returned to the user. This two-stage design balances scalability (via bi-encoder retrieval) with precision (via cross-encoder reranking), resulting in both efficient and high-quality response generation.
重排序器是传统信息检索系统和现代 RAG 流程中的关键组件。在一般信息检索中,重排序器会对通过快速、通常是近似的方法检索到的初始候选文档列表进行优化。这种第二阶段的重排序对于确保检索到语义或上下文相关性最高的结果至关重要。首先出现的是,随着神经搜索和大规模向量数据库的兴起,重排序器变得更加重要,因为它们弥合了高召回率检索和高精度语义理解之间的差距。
Rerankers are pivotal components in both traditional information retrieval systems and modern RAG pipelines. In general information retrieval, rerankers refine an initial list of candidate documents retrieved by a fast, often approximate method. This second-stage reranking is crucial for ensuring that the most semantically or contextually relevant results are surfaced first. With the rise of neural search and large-scale vector databases, rerankers have become even more important as they bridge the gap between high-recall retrieval and high-precision semantic understanding.
在 RAG 系统中,重排序器扮演着更为关键的角色。典型的 RAG 流程包括检索与用户查询相关的段落或文档,然后将其输入语言模型以生成符合语境的回复。如果检索到的内容相关性较弱或包含大量噪声,最终生成的回复可能包含不实信息或错误信息。重排序器通过基于更深层次语义评估(通常使用强大的语言模型)对检索到的候选内容进行重新排序,从而帮助解决这个问题。这确保只有最相关且上下文最恰当的段落才会被传递到生成阶段,从而提高系统的准确性和可靠性。
In the context of RAG systems, rerankers take on an even more critical role. A typical RAG pipeline involves retrieving passages or documents relevant to a user query and then feeding those to a language model to generate grounded responses. If the retrieved content is only loosely relevant or noisy, the final generation may contain hallucinations or inaccuracies. Rerankers help solve this problem by reordering the retrieved candidates based on a deeper semantic evaluation, often using powerful language models. This ensures that only the most relevant and contextually appropriate passages are forwarded to the generative stage, improving the accuracy and reliability of the system.
重新排名者的类别如下:
The categories of rerankers are as follows:
Cohere 的 Rerank应用程序编程接口( API )等商业产品便是这种方法的典型例证。这些服务允许开发者提交查询和检索到的文档列表,并基于深度语义匹配返回重新评分和重新排序的列表。当精确度比速度或成本更重要时,例如在法律检索、学术研究或候选库相对较小的质量保证系统中,交叉编码器重排序器是理想之选。
Commercial offerings such as Cohere's Rerank application programming interface (API) exemplify this approach. These services allow developers to submit a query and a list of retrieved documents, returning a rescored and reordered list based on deep semantic matching. Cross-encoder rerankers are ideal when precision is more important than speed or cost, such as in legal search, academic research, or QA systems with relatively small candidate pools.
与交叉编码器相比,后期交互模型具有显著更优的可扩展性,非常适合大型数据集。诸如 ColBERTv2 之类的变体采用矢量量化和降维等先进技术,在保持高检索精度的同时降低存储成本。尽管后期交互模型的精确度不如交叉编码器,但它们在有效性和效率方面通常优于传统的双编码器和单向量检索方法。
Late interaction models offer significantly better scalability than cross-encoders and are well-suited for large collections. Variants like ColBERTv2 use advanced techniques such as vector quantization and dimensionality reduction to reduce storage costs while maintaining high retrieval accuracy. Although late interaction models are not as precise as cross-encoders, they often outperform traditional bi-encoders and single vector retrieval approaches in both effectiveness and efficiency.
两阶段混合方法在企业搜索和评级、可用性、可搜索性 (RAG) 系统中尤为常见。首先,使用快速的词汇或向量方法检索初始候选词库;然后,应用更强大的重排序器(通常是交叉编码器)对前 N 个结果进行重新排序。这种设置结合了第一阶段的召回率和第二阶段的精确率,从而兼顾了可扩展性和语义深度。在某些系统中,甚至在应用业务逻辑或用户特定约束后,在第三阶段也使用重排序。
A two-stage hybrid approach is especially common in enterprise search and RAG systems. An initial candidate pool is retrieved using a fast lexical or vector-based method, and then a more powerful reranker, often a cross-encoder, is applied to reorder the top-N results. This setup combines the recall strength of the first-stage with the precision of the second, enabling both scalability and semantic depth. In some systems, reranking is even used in a third stage after applying business logic or user-specific constraints.
在 RAG 流程中,重排序能够显著提升生成前文档检索的质量。例如,向量搜索可能基于余弦相似度检索到 50 个文档,但排名靠前的文档未必是最相关的。重排序器(无论是交叉编码器还是后期交互模型)可以对这些候选文档进行重新排序,确保只有最相关的文档才能进入 LLM 的上下文窗口。这不仅提高了生成精度,而且通过将输出结果与语义对齐的信息联系起来,减少了结果的不确定性。
In RAG pipelines, reranking significantly improves the quality of document retrieval before generation. For example, a vector search might retrieve fifty documents based on cosine similarity, but the top-ranked ones might not always be the most relevant. A reranker, whether a cross-encoder or late interaction model, can reorder these candidates, ensuring that only the most relevant ones are passed into the LLM's context window. This not only improves generation accuracy but also reduces hallucinations by grounding the output in semantically aligned information.
因此,重排序器在 RAG 中充当语义过滤器,将文档池压缩并提炼成一个聚焦的、高精度的上下文,用于生成结果。许多现代 RAG 实现,包括 LangChain 和 LlamaIndex,现在都将重排序作为内置或可选模块。Qdrant 、Weaviate和Pinecone等向量数据库也支持超量检索和重排序工作流程,使开发人员能够轻松地将快速检索与精确的语义排序相结合。
Rerankers thus serve as a semantic filter in RAG, compressing and distilling the document pool into a focused, high-precision context for generation. Many modern RAG implementations, including those in LangChain and LlamaIndex, now include reranking as a built-in or optional module. Vector databases like Qdrant, Weaviate, and Pinecone also support over-fetching and reranking workflows, allowing developers to easily combine fast retrieval with accurate semantic sorting.
在多模态随机抽样(RAG)系统中,检索保真度至关重要,它能确保检索到的上下文与输入查询(无论是文本、图像还是多种模态的组合)的相关性和一致性。虽然初始检索通常采用双编码器或双编码器以提高计算可扩展性,但此阶段产生的粗略相似度得分可能缺乏细粒度的语义对齐。这就需要中间重排序阶段,该阶段能够以更高的表达能力和精度评估候选文档。
In multimodal RAG systems, retrieval fidelity is critical to ensure the relevance and alignment of retrieved context with the input query, be it text, image, or a combination of modalities. While initial retrieval is often handled by bi-encoders or dual encoders for computational scalability, the coarse similarity scores produced at this stage may lack fine-grained semantic alignment. This introduces the need for an intermediate reranking stage, which evaluates candidate documents with greater expressiveness and precision.
最有效的重排序策略之一是使用交叉编码器。交叉编码器是一种联合编码查询和每个候选文档的模型,它能计算出更准确的相关性得分。与双向编码器(它独立计算查询和文档的嵌入,并使用余弦相似度或点积相似度进行比较)不同,交叉编码器在两个输入之间执行完整的词元级交互。这种设计支持丰富的交叉注意力机制和更深层次的语义推理,从而获得更高质量的排序结果。
One of the most effective reranking strategies involves the use of cross-encoders, which are models that jointly encode both the query and each candidate document to compute a more accurate relevance score. In contrast to bi-encoders, where embeddings for queries and documents are computed independently and compared using cosine or dot product similarity, cross-encoders perform full token-level interaction between the two inputs. This design allows for rich cross-attention mechanisms and deeper semantic reasoning, resulting in higher-quality rankings.
在多模态 RAG 环境中,查询或文档(或两者)可能包含文本和图像对,因此交叉编码器必须能够融合视觉和文本输入。这通常通过视觉语言模型( VLM ) 来实现,例如 CLIP、Bootstrapping Language-Image Pre-training ( BLIP )、Flamingo,或者更新的基于 Transformer 的架构,如 GIT、OFA 或 Qwen-VL。这些模型将图像和文本联合编码,使模型能够处理多模态输入。
In a multimodal RAG context, where either the query or documents (or both) may consist of text and image pairs, a cross-encoder must be capable of fusing visual and textual inputs. This is typically achieved through vision-language models (VLMs) such as CLIP, Bootstrapping Language-Image Pre-training (BLIP), Flamingo, or newer transformer-based architectures like GIT, OFA, or Qwen-VL. These models encode image and text jointly, enabling the model to reason over multimodal inputs.
对于重新排名,常见的流程包括:
For reranking, a common pipeline involves:
虽然像 ColBERT、ColPali 和 ColQwen 这样的后期交互模型也提供词元级评分,但它们保持查询词元和文档词元的独立编码,推迟精细化评分。与评分阶段相比,交叉编码器可以同时处理两个序列,并通过交叉注意力层实现全局的词元间交互。这使得交叉编码器更具表达力,但计算成本也更高,因为它们必须对每个查询-文档对进行单独编码。
While late interaction models like ColBERT, ColPali, and ColQwen also provide token-level scoring, they maintain independent encoding of query and document tokens, deferring fine-grained comparison to the scoring stage. In contrast, cross-encoders process both sequences simultaneously, enabling global token-to-token interactions via cross-attention layers. This makes cross-encoders more expressive but computationally expensive, as they must encode each query-document pair individually.
表 9.1比较了 RAG 系统中用于检索和重排序的三种常用架构:双编码器、后期交互模型和交叉编码器。每种方法在可扩展性和准确性之间都提供了不同的权衡,这取决于查询-文档对的编码和比较方式。双编码器通过独立编码输入来优先考虑速度和可扩展性,使其成为大规模第一阶段检索的理想选择。后期交互模型在编码后引入词元级比较,从而在性能和成本之间取得平衡。交叉编码器虽然计算量大,但通过联合编码并与两个输入进行深度交互,能够提供最高的准确性,使其成为小规模候选集上精确重排序的首选。
Table 9.1 compares three common architectures used for retrieval and reranking in RAG systems: bi-encoders, late interaction models, and cross-encoders. Each approach offers a different trade-off between scalability and accuracy, based on how query-document pairs are encoded and compared. Bi-encoders prioritize speed and scalability by independently encoding inputs, making them ideal for large-scale first-stage retrieval. Late interaction models introduce token-level comparisons post-encoding, striking a balance between performance and cost. Cross-encoders, though computationally intensive, deliver the highest accuracy by jointly encoding and deeply interacting with both inputs, making them the preferred choice for precision reranking over small candidate sets.
|
特征 Feature |
双编码器 Bi-encoder |
后期互动 Late interaction |
交叉编码器 Cross-encoder |
|
编码 Encoding |
独立的 Independent |
独立的 Independent |
联合的 Joint |
|
相互作用 Interaction |
没有任何 None |
标记级(编码后) Token-level (post-encode) |
完整(编码器内) Full (within encoder) |
|
可扩展性 Scalability |
高的 High |
缓和 Moderate |
低的 Low |
|
准确性 Accuracy |
缓和 Moderate |
高的 High |
最高 Highest |
|
RAG 中的用例 Use case in RAG |
第一阶段的回收者 First-stage retriever |
轻量级重排器 Light-weight reranker |
精确重排序器(小样本量) Precision reranker (small N) |
Table 9.1: Comparison of architectures for retrieval and reranking in RAG systems
在产品搜索、医学影像、视觉问答(VQA )和交互式助手等多模态应用场景中,查询可能包含一个问题和一张图片,或者系统可能使用图片作为输入,从文档库中检索相关文本。交叉编码器在这些场景中扮演着至关重要的角色,它确保检索到的文档与查询在语义和模态上保持一致。例如,当用户提交一张笔记本电脑的图片并提出“哪款笔记本电脑同时具备HDMI和USB-C接口?”这样的问题时,交叉编码器可以同时关注图片和产品描述,从而对最相关的匹配结果进行重新排序。
In multimodal use cases, such as product search, medical imaging, visual question answering (VQA), and interactive assistants, a query might consist of a question paired with an image, or the system might retrieve relevant text from a document corpus using an image as the input. Cross-encoders play a vital role in these setups by ensuring that retrieved documents exhibit semantic and modality-aware alignment with the query. For example, when a user submits an image of a laptop with a query, which model has HDMI and USB-C? A cross-encoder can jointly attend to both the image and product descriptions to rerank the most relevant matches.
尽管交叉编码器精度很高,但其计算成本很高,尤其是在图像需要高维编码和预处理的多模态场景中。为了缓解以下问题,人们采用了多种策略:
Despite their accuracy, cross-encoders are computationally expensive, especially in multimodal scenarios where images require high-dimensional encoding and preprocessing. Several strategies are adopted to mitigate the following:
在多模态 RAG 系统中,基于交叉编码器的重排序充当高精度过滤器,在将粗略的检索结果传递给语言模型进行生成之前对其进行精细化处理。通过允许查询词和候选词之间进行充分的交互(包括跨图像和文本输入),交叉编码器显著增强了语义匹配。尽管与其他重排序方法相比,其计算量更大,但由于只需处理少量候选词,因此在实际应用中,交叉编码器对于提升检索质量具有可行性和价值。
In multimodal RAG systems, cross-encoder-based reranking acts as a high-precision filter, refining coarse retrieval outputs before they are passed to the language model for generation. By allowing full interaction between query and candidate tokens, including across image and text inputs, cross-encoders significantly enhance semantic matching. Although computationally heavier than other reranking approaches, their deployment on a small number of candidates makes them feasible and valuable for improving retrieval quality in real-world applications.
目前,多家技术提供商提供托管式重排序解决方案,无需开发和维护内部模型,即可轻松集成到搜索或 RAG 流程中。其中最引人注目的是 Cohere 的 Rerank API,它是一款功能强大的基于 Transformer 的交叉编码器,能够接收查询和候选文档列表,并根据语义相关性重新排序,每个文档都附带一个置信度分数。该模型联合处理查询和每个文档,从而实现对上下文的深度理解和精准匹配。最新版本的服务支持长文档、多语言功能以及各种内容类型(包括代码和半结构化数据),同时与早期版本相比,延迟更低,效率更高。
Several technology providers now offer hosted reranking solutions that can be easily integrated into search or RAG pipelines without the need for developing and maintaining in-house models. Among the most prominent is Cohere's Rerank API, a powerful transformer-based cross-encoder that takes a query along with a list of candidate documents and returns them reordered by semantic relevance, each with an associated confidence score. This model processes the query and each document jointly, enabling deep contextual understanding and precise matching. The latest versions of the service support long documents, multilingual capabilities, and various content types, including code and semi-structured data, while maintaining improved latency and efficiency compared to earlier releases.
其他云服务提供商也提供类似的重排序功能。微软的 Azure 认知搜索包含语义重排序功能,它利用图灵序列中的 Transformer 模型来增强前 k 个搜索结果的相关性。这种语义重排序还可以选择性地为排名结果生成高亮显示和解释,使其适用于企业搜索应用。
Other cloud providers offer similar reranking capabilities. Microsoft’s Azure Cognitive Search includes a semantic reranking feature that enhances the relevance of top-k results using transformer-based models from the Turing series. This semantic reranking can optionally generate highlights and explanations for the ranked results, making it suitable for enterprise search applications.
亚马逊通过 Amazon Kendra 和 Amazon Bedrock 等服务提供多种重排序选项。Bedrock 用户可以直接在亚马逊网络服务( AWS ) 生态系统中访问托管的重排序工具,例如 Cohere 的 API ,从而在现有向量或关键词搜索输出的基础上实现高精度语义重排序。
Amazon provides multiple reranking options through services like Amazon Kendra and Amazon Bedrock. Bedrock users can access hosted rerankers such as Cohere’s API directly within the Amazon Web Services (AWS) ecosystem, enabling high-accuracy semantic reranking on top of existing vector or keyword search outputs.
开源生态系统也支持与托管重排序工具集成。例如,OpenSearch和Elasticsearch可以配置为使用外部 API 作为第二阶段重排序工具。一些开源工具,例如Answer.AI的重排序库,提供了统一的 Python 接口,可与各种重排序模型配合使用,使开发人员能够以最小的努力插入交叉编码器或后期交互模型等替代方案。这些集成使得使用复杂的神经重排序模型升级标准搜索流程成为可能,从而显著提高最终结果的质量。
Open-source ecosystems also support integration with hosted rerankers. For example, OpenSearch and Elasticsearch can be configured to use external APIs as second-stage rerankers. Some open-source tools, such as Answer.AI's reranker library, provide unified Python interfaces to a variety of reranking models, allowing developers to plug in alternatives like cross-encoders or late interaction models with minimal effort. These integrations make it feasible to upgrade standard search pipelines with sophisticated neural reranking models that significantly improve final result quality.
如图 9.2和第 1 章“新时代生成式人工智能导论”中所述,交叉编码器是一种神经网络模型架构,常用于需要细粒度交互的任务。 它采用一对输入,尤其适用于语义相似度、排序和问答等任务。与双编码器不同的是,它能够联合处理查询和候选(例如文档),从而在整个Transformer架构中实现词元级的交叉注意力机制。
A cross-encoder, as explained in Figure 9.2 and Chapter 1, Introducing New Age Generative AI, is a neural model architecture commonly used in tasks requiring fine-grained interaction between a pair of inputs, most notably in semantic similarity, ranking, and QA. It is distinguished from bi-encoders by the fact that it processes both the query and the candidate (e.g., document) jointly, allowing token-level cross-attention throughout the entire transformer stack.
在检索系统和 RAG 架构的背景下,区分双编码器和交叉编码器至关重要,尤其是在它们的嵌入能力和索引功能方面。
In the context of retrieval systems and RAG architectures, it is essential to distinguish between bi-encoders and cross-encoders, particularly regarding their embedding capabilities and indexing functionality.
交叉编码器是一种模型架构,它联合处理一对输入,通常是查询和候选对象[例如,(查询,文档)或(查询,图像)对]。与独立生成查询和文档嵌入的双向编码器不同,交叉编码器不生成可重用、可索引的嵌入。相反,它通过共享的Transformer模型对两个输入进行编码,从而计算出一个单一的相关性得分(例如,相似度logit)。该得分量化了查询与候选对象的匹配程度,但不会生成任一输入的持久向量表示。
A cross-encoder is a model architecture that jointly processes a pair of inputs, typically a query and a candidate [e.g., (query, document) or (query, image) pair]. Unlike bi-encoders, which generate standalone embeddings for queries and documents independently, cross-encoders do not produce reusable, indexable embeddings. Instead, they compute a single relevance score (e.g., a similarity logit) by encoding both inputs together through a shared transformer model. This score quantifies how well the query matches the candidate, but it does not result in a persistent vector representation of either input.
因此,交叉编码器不适合用于索引。它们不会生成可以存储在向量数据库(例如,Facebook AI 相似性搜索)中的向量表示。Faiss )、Qdrant、ChromaDB)或用于最近邻搜索。相反,它们被用于重排序场景,其中一小集候选结果(通过双向编码器或关键词搜索检索)会被重新评分,以获得更高的语义准确性。
As a result, cross-encoders are not suitable for indexing. They do not generate vector representations that can be stored in vector databases (e.g., Facebook AI Similarity Search (Faiss), Qdrant, ChromaDB) or used for nearest neighbor search. Instead, they are employed in reranking scenarios, where a small set of candidates (retrieved via bi-encoders or keyword search) is rescored for finer semantic accuracy.
为了理解编码器架构之间的实际差异,考察它们对可索引嵌入的支持情况以及这种支持如何影响它们在检索工作流程中的作用十分重要。双向编码器生成适用于大规模搜索的可重用向量表示,而交叉编码器则直接处理查询-文档对,无需生成独立的嵌入即可实现高精度的语义重排序。下表总结了这种根本性的架构差异:
To understand the practical differences between encoder architectures, it is useful to examine their support for indexable embeddings and how that impacts their role in retrieval workflows. While bi-encoders generate reusable vector representations suitable for large-scale search, cross-encoders operate directly on query-document pairs, enabling high-accuracy semantic reranking without producing standalone embeddings. This fundamental architectural difference is summarized in the following table:
|
编码器类型 Encoder type |
可索引嵌入 Indexable embeddings |
主要用途 Primary use |
|
双编码器 Bi-encoder |
是的 Yes |
向量搜索与检索 Vector search and retrieval |
|
交叉编码器 Cross-encoder |
不 No |
语义重排序 Semantic reranking |
Table 9.2: Comparison of bi-encoder and cross-encoder architectures
因此,交叉编码器针对的是评分而非存储进行优化。它们依赖于联合输入编码,因此无法生成分离的查询向量或文档向量。所以,在 RAG 系统中,它们与双向编码器互补,在重排序阶段提高精度,但在初始检索或索引阶段则无此作用。
So, cross-encoders are optimized for scoring, not storage. Their reliance on joint input encoding precludes them from producing detached query or document vectors. Therefore, in RAG systems, they serve a complementary role to bi-encoders by enhancing precision during the reranking stage, but not during the initial retrieval or indexing phases.
在 RAG 系统中,多索引嵌入通过为文本、图像或代码等不同数据类型维护独立的向量索引,实现了模块化和模态感知的信息检索。每个索引都使用由特定模态模型生成的嵌入构建,从而能够根据查询的性质进行精确检索。这种策略在多模态应用中尤为有效,能够实现灵活的路由和来自不同来源的混合检索。相比之下,交叉编码器不生成可索引的嵌入。相反,它们联合处理查询和候选对,并输出一个标量相关性得分。该得分反映了语义对齐,但不能重用或存储用于基于向量的搜索。因此,交叉编码器仅应用于重排序阶段,在该阶段,通过多索引嵌入检索到的一小部分候选结果会被重新评估以进行最终选择。这些方法共同构成了一个稳健的架构:多索引嵌入确保了广度和模态覆盖,而交叉编码器则在流程的最后一步提高了语义精度。因此,交叉编码器不会生成多索引嵌入。让我们来了解一下什么是多索引嵌入。
In RAG systems, multi-index embedding enables modular and modality-aware information retrieval by maintaining separate vector indexes for different data types such as text, images, or code. Each index is constructed using embeddings generated from modality-specific models, facilitating precise retrieval tailored to the nature of the query. This strategy is particularly effective in multimodal applications, allowing for flexible routing and hybrid retrieval from diverse sources. In contrast, cross-encoders do not generate indexable embeddings. Instead, they process a query and candidate pair jointly and output a single scalar relevance score. This score reflects semantic alignment but cannot be reused or stored for vector-based search. As a result, cross-encoders are exclusively applied in the reranking phase, where a small set of candidates retrieved via multi-index embeddings are re-evaluated for final selection. Together, these approaches offer a robust architecture: multi-index embeddings ensure breadth and modality coverage, while cross-encoders enhance semantic precision at the final step of the pipeline. So cross cross-encoders do not create multi-index embeddings. Let us understand what a multi-index embedding is.
多索引嵌入是指在 RAG 架构中构建和使用多个向量索引。这种方法使系统能够从异构数据源中检索语义相关的信息,从而提高生成模型的精确度、上下文对齐能力和多模态推理能力。为了更好地理解,请参考以下列表:
Multi-index embedding refers to the construction and utilization of multiple vector indexes within RAG architectures. This approach enables systems to retrieve semantically relevant information from heterogeneous data sources, improving the precision, contextual alignment, and multimodal reasoning capabilities of the generative model. To build an understanding, refer to the following list:
多索引嵌入通过对不同内容类型进行差异化处理,为 RAG 系统引入了模块化和精确性。它支持混合和多模态检索策略,这对于开发能够跨越各种数据环境进行推理的强大 AI 系统至关重要。
Multi-index embedding introduces modularity and precision into RAG systems by allowing differentiated treatment of diverse content types. It supports hybrid and multimodal retrieval strategies, which are essential for developing robust AI systems capable of reasoning across varied data landscapes.
[CLS] 查询词法单元 [SEP] 文档词法单元 [SEP]
转换器处理此序列,输出通常取自[CLS]标记,该标记聚合了整个输入的上下文表示。
[CLS] Query tokens [SEP] Document tokens [SEP]
The transformer processes this sequence, and the output is typically taken from the [CLS] token, which aggregates the contextualized representation of the entire input.
在每个 Transformer 层l ,表示更新如下:
At each transformer layer l, the representation is updated as:
H^{l} = TransformerLayer(H^{l−1})
H^{l} = TransformerLayer(H^{l−1})
其中TransformerLayer对Q和D中的所有标记同时应用多头自注意力机制,从而实现完全的交叉交互。
Where TransformerLayer applies multi-head self-attention across all tokens in Q and D together, allowing for full cross-interaction.
在最后一层l ,采用池化策略:
At the final layer l, a pooling strategy is used:
通常,最终输出向量z ∈ ℝ ^d是[CLS]标记的表示,记为z = H^L_0
Often, the final output vector z ∈ ℝ^d is the representation of the [CLS] token, denoted z = H^L_0
该向量被传递给评分头(例如,前馈层后接 sigmoid 或 softmax 层)进行预测:
This vector is passed to a scoring head (e.g., a feed-forward layer followed by a sigmoid or softmax) to predict:
示例:相关性评分
Example: Relevance scoring
查询Q与文档D之间的最终输出得分s可计算如下:
The final output score s between a query Q and document D may be computed as:
s = sigmoid ( w T z+b )其中:
s = sigmoid (wT z+b) Where:
在训练过程中,可以使用二元标签(相关或不相关)对该分数进行监督,并使用诸如以下损失函数:
In training, this score can be supervised using binary labels (relevant or not), using loss functions such as:
在许多检索系统(例如两阶段 RAG)中,交叉编码器构成第二阶段。其中,双编码器或向量搜索检索候选文档,而交叉编码器则基于语义丰富度优化排序。它们在相关性建模方面的高保真度证明了其计算成本的合理性。
Cross-encoders form the second-stage in many retrieval systems (like two-stage RAG), where a bi-encoder or vector search retrieves candidate documents and a cross-encoder refines the ranking based on semantic richness. Their computational cost is justified by their high fidelity in relevance modeling.
以下代码实现了一个模块化的多模态 RAG 流程,用于根据图像、文本或混合输入检索和生成笔记本电脑规格。该系统利用基于 CLIP 的图像-文本嵌入、ChromaDB 进行矢量搜索以及基于 Ollama 的 LLM 进行生成,提供多种查询模式:纯图像、纯文本、图像+文本以及生成式答案补全。
The following code implements a modular Multimodal RAG pipeline to retrieve and generate laptop specifications based on image, text, or hybrid inputs. Leveraging CLIP-based image-text embeddings, ChromaDB for vector search, and an Ollama-based LLM for generation, the system offers multiple query modes: image-only, text-only, image + text, and generative answer completion.
本节概述了基于 RAG 流水线构建的模块化多模态助手系统的完整架构和实现细节。它从config.py中的集中式配置管理开始,逐步介绍基于 CLIP 的嵌入函数、数据加载器以及基于 ChromaDB 的文本和图像内容索引创建。该系统支持多种检索模式,并支持联合查询的向量融合。为了提高精度,在最终输出之前应用了基于交叉编码器的重排序器。该系统还集成了基于 Ollama 的文本生成功能和提供四种交互模式的 Streamlit 用户界面。这些组件共同展示了一个可扩展的 RAG 实现,适用于实际的多模态搜索和问答场景,详情如下:
This section outlines the full architecture and implementation details of a modular, multimodal assistant system built using a RAG pipeline. It begins with centralized configuration management in config.py and moves through CLIP-based embedding functions, data loaders, and ChromaDB-based index creation for both text and image content. It supports multiple retrieval modes and enables vector fusion for joint queries. To enhance precision, a cross-encoder-based reranker is applied before the final output. The system also integrates Ollama-based text generation and a Streamlit UI offering four interactive modes. Together, these components demonstrate a scalable and extensible RAG implementation for real-world multimodal search and question answering, details as follows:
CHROMA_PERSIST_DIR = "chromadb_storage"
CHROMA_IMAGE_COLLECTION = "laptop_images"
CHROMA_TEXT_COLLECTION = "laptop_texts"
IMAGE_FOLDER = "data/images"
TEXT_FOLDER = "data/documents"
EMBED_MODEL_NAME = "剪辑"
MODEL_NAME = "llama3"
CHROMA_PERSIST_DIR = "chromadb_storage"
CHROMA_IMAGE_COLLECTION = "laptop_images"
CHROMA_TEXT_COLLECTION = "laptop_texts"
IMAGE_FOLDER = "data/images"
TEXT_FOLDER = "data/documents"
EMBED_MODEL_NAME = "clip"
MODEL_NAME = "llama3"
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
嵌入由以下方式处理:
Embedding is handled by:
def embed_text_ollama(text):
def embed_text_ollama(text):
inputs = clip_processor(text=[text], return_tensors="pt",padding=True,truncation=True)
inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
...
...
返回 outputs[0].tolist()
return outputs[0].tolist()
def embed_image_ollama(image_path):
def embed_image_ollama(image_path):
image = Image.open(image_path).convert("RGB")
image = Image.open(image_path).convert("RGB")
...
...
返回 outputs[0].tolist()
return outputs[0].tolist()
这些函数生成用于在 ChromaDB 向量存储中进行检索的向量表示。
These functions produce vector representations used for retrieval in the ChromaDB vector store.
def load_text_documents(folder):
...
返回文档
Python
编辑
def load_image_paths(folder):
...
返回 [os.path.join(folder, f) ...]
def load_text_documents(folder):
...
return docs
python
CopyEdit
def load_image_paths(folder):
...
return [os.path.join(folder, f) ...]
text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)
image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)
每个项目都是通过以下方式嵌入和添加的:
text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
此步骤由 run_once.py 控制:
from rag.index_builder import build_index
如果 __name__ == "__main__":
构建索引()
text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)
image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)
Each item is embedded and added using:
text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
This step is orchestrated via run_once.py:
from rag.index_builder import build_index
if __name__ == "__main__":
build_index()
我们的代码正在创建多索引向量。在 ChromaDB 中创建两个独立的索引,一个用于文本,一个用于图像:
Our code is performing multi-index vector creation. two separate indexes, one for text and one for images, within ChromaDB:
|
注意:您可以通过组合多个嵌入(例如,图像 + 文本)来创建一个长向量,而我们的代码已经在图像 + 文本→ specs 模式下实现了这一点: |
joint_vec = [(i + j) / 2 for i, j in zip(image_vec, text_vec)]
这是两个相同长度向量的简单平均融合。
|
Note: You can create a single long vector by combining multiple embeddings (e.g., image + text), and our code already does this in the image + text → specs mode: |
joint_vec = [(i + j) / 2 for i, j in zip(image_vec, text_vec)]
This is a simple average fusion of two same-length vectors.
joint_vec = image_vec + text_vec # 会生成一个更长的向量(例如,如果每个向量都是 512,则总向量为 1024)
joint_vec = image_vec + text_vec # results in a longer vector (e.g., 1024 if each is 512)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
给定查询和候选元数据,重排序器根据语义相似性对结果进行评分和排名:
Given a query and candidate metadata, the reranker scores and ranks results based on semantic similarity:
def rerank(query, metadatas):
def rerank(query, metadatas):
pairs = [(query, doc.get("file", "")) for doc in metadatas]
pairs = [(query, doc.get("file", "")) for doc in metadatas]
...
...
返回 [doc for doc, _ in ranking]
return [doc for doc, _ in ranked]
这提高了 ChromaDB 返回的最佳结果的精确度。
This improves the precision of top results returned by ChromaDB.
def get_llm():
返回 Ollama(model=MODEL_NAME, temperature=0.2)
除了简单的检索之外,这对于生成人类可读的规范或摘要非常有用。
def get_llm():
return Ollama(model=MODEL_NAME, temperature=0.2)
This is useful for generating human-readable specifications or summaries beyond simple retrieval.
如果模式 == "文本 → 生成的答案":
query = st.text_input("询问有关笔记本电脑的问题")
如果查询:
llm = get_llm()
response = llm.invoke(query)
st.text_area("LLM 响应", response, height=300)
if mode == "Text → Generated Answer":
query = st.text_input("Ask something about laptops")
if query:
llm = get_llm()
response = llm.invoke(query)
st.text_area("LLM Response", response, height=300)
每种模式都与相应的 ChromaDB 集合进行交互,并执行重新排序,以确保显示最相关的响应。
Each mode interacts with the respective ChromaDB collection and performs reranking to ensure the most relevant response is shown.
这个模块化、多模态的助手系统是 RAG 流程的实际应用示例。通过将配置、嵌入、检索、重排序和生成等步骤清晰地分离,该系统保持了高度的可扩展性和易于维护性。未来的增强功能可能包括文档摘要、多语言支持或用于聊天交互的记忆机制。
This modular, multimodal assistant system exemplifies a real-world implementation of a RAG pipeline. By cleanly separating configuration, embedding, retrieval, reranking, and generation, the system remains highly extensible and easily maintainable. Future enhancements may include document summarization, multilingual support, or a memory mechanism for chat-based interaction.
虽然目前的实现方式建立了一个强大的多模态检索流程,但需要注意的是,它目前还不支持生成式输出。
While the current implementation establishes a robust multimodal retrieval pipeline, it is important to recognize that it does not yet support generative outputs.
在项目的当前阶段,有两个关键项目故意未完成,以鼓励实践和更深入的理解。
In the current state of the project, two key items are intentionally left incomplete to encourage hands-on practice and deeper understanding.
目前,rag/文件夹中还没有包含功能齐全的generation.py 文件。你的任务是根据预期功能创建这个模块:
|
注意:此功能将使您的多模态 RAG 助手不仅可以检索规格,还可以生成流畅的笔记本电脑功能解释或摘要。 |
At this point, the rag/ folder does not yet include a fully functional generation.py. Your task is to create this module based on the intended functionality:
|
Note: This addition will allow your multimodal RAG assistant to not only retrieve specs but also generate fluent explanations or summaries of laptop features. |
文件run_once.py (用于从所有可用的笔记本电脑图像和规格文档构建 ChromaDB 索引)应该移动到scripts/文件夹中(如果尚未移动到该文件夹中)。
python -m scripts.run_once
一旦run_once.py就位且generation.py实现,您的完整多模态 RAG 系统就将完成并可投入生产使用。
The file run_once.py, which builds your ChromaDB index from all available laptop images and specification documents, should be moved into the scripts/ folder (if not already).
python -m scripts.run_once
Once run_once.py is in place and generation.py is implemented, your full multimodal RAG system will be complete and production-ready.
以下是完整的设置说明,指导您从零开始搭建并运行多模式 RAG 生成系统:
Here are the complete setup instructions to get your multimodal RAG system with generation up and running from scratch:
1. 环境要求:请确保您使用的是:
1. Environment requirements: Ensure that you are using:
a. Python 3.9 或更高版本
a. Python 3.9 or later
b. 皮普或康达
b. Pip or conda
c. 互联网接入(用于下载模型)
c. Internet access (to download models)
2. 目录结构:请按照下图所示设置文件夹:
2. Directory structure: Setup your folder like shown in the following figure:
3. 安装依赖项:创建虚拟环境并安装所需的软件包:
3. Install dependencies: Create a virtual environment and install required packages:
python -m venv venv
python -m venv venv
source venv/bin/activate # 在 Windows 系统上:venv\Scripts\activate
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install --upgrade pip
pip install streamlit torch torchvision transformers sentence-transformers chromadb langchain
pip install streamlit torch torchvision transformers sentence-transformers chromadb langchain
4. 下载预训练模型(可选:仅限首次使用):
4. Download pretrained models (Optional: First time only):
a. 首次运行将下载:
a. Your first run will download:
i. openai/clip-vit-base-patch32 用于图像/文本嵌入
i. openai/clip-vit-base-patch32 for image/text embedding
二. 用于重新排名的交叉编码器/ms-marco-MiniLM-L-6-v2
ii. cross-encoder/ms-marco-MiniLM-L-6-v2 for reranking
请确保网络连接稳定。
Make sure you have a stable internet connection.
5. 准备数据:将您的.txt规格文档和.jpg笔记本电脑图片放入以下位置:
5. Prepare your data: Place your .txt spec documents and .jpg laptop images in:
数据/文档/
data/documents/
数据/图像/
data/images/
一个。 确保文本和图像文件名一致(例如,dell_inspiron.jpg和dell_inspiron.txt )。
a. Ensure the text and image filenames correspond (e.g., dell_inspiron.jpg and dell_inspiron.txt).
b. 构建索引(初始):运行一次以创建 ChromaDB 集合:
b. Build index (Initial): Run once to create ChromaDB collections:
python run_once.py
python run_once.py
这会将所有文本和图像嵌入到 Chroma 中并持久存储。
This embeds all text and images into Chroma and stores them persistently.
c. 启动应用程序:启动 Streamlit 应用程序:
c. Launch the app: Start the Streamlit app:
streamlit 运行 app.py
streamlit run app.py
在浏览器中访问:http://localhost:8501/
Access in your browser at: http://localhost:8501/
d. Requirements.txt(可选):
d. Requirements.txt (Optional):
Streamlit
streamlit
火炬
torch
变压器
transformers
句子转换器
sentence-transformers
chromadb
chromadb
langchain
langchain
枕头
Pillow
e. 然后,运行以下命令:
e. Then, run the following commands:
狂欢
bash
编辑
CopyEdit
pip install -r requirements.txt
pip install -r requirements.txt
在本章中,您探索了重排序在增强多模态 RAG 系统中信息检索方面的作用。通过对重排序器进行分类并重点关注强大的交叉编码器方法,您学习了如何提高从文本和图像数据中检索结果的质量。您研究了多模态环境下交叉编码器的架构和逻辑,并实现了一个可运行的重排序器来改进图像-文本检索流程。为了巩固您的理解,一系列实践练习要求您填写缺失的代码和结构。在下一章中,我们将探索各种检索优化技术。
In this chapter, you explored the role of reranking in enhancing information retrieval within multimodal RAG systems. By categorizing rerankers and focusing on the powerful cross-encoder approach, you learned how to improve the quality of results retrieved from both textual and visual data. You examined the architecture and logic behind cross-encoders in multimodal contexts and implemented a working reranker to refine image-text retrieval pipelines. To solidify your understanding, a set of practical to dos challenged you to fill in missing code and structure. In the next chapter, we will explore various retrieval optimization techniques.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
有效的检索优化对于构建稳健且响应迅速的生成式人工智能(GenAI )系统至关重要,尤其是在多模态和检索增强生成(RAG )场景中。在实际部署中,仅仅嵌入和检索数据是不够的;优化检索流程会显著影响生成响应的准确性、效率和相关性。
Effective retrieval optimization is critical to building robust and responsive generative AI (GenAI) systems, particularly in multimodal and retrieval-augmented generation (RAG) scenarios. In practical deployments, merely embedding and retrieving data is insufficient; optimizing the retrieval pipeline significantly impacts the accuracy, efficiency, and relevance of generated responses.
本章系统地探讨了多索引嵌入、基于模态的路由和混合检索等关键检索优化技术。我们不仅对每种方法进行了概念定义,还提供了清晰可执行的Python代码示例,以展示它们的实现和实际应用价值。通过应用查询扩展、嵌入归一化和自适应索引刷新等技术,读者将学习如何在生产级GenAI系统中提升系统的召回率、精确率、适应性和关键属性。
In this chapter, we systematically explore key retrieval optimization techniques such as multi-index embedding, modality-based routing, and hybrid retrieval. We not only define each method conceptually but also provide clear, executable Python code examples that illustrate their implementation and practical utility. By applying techniques like query expansion, embedding normalization, and adaptive index refresh, readers will learn to enhance system recall, precision, adaptability, and critical attributes in production-level GenAI systems.
本章的重要性在于其详尽的实践方法,旨在提升检索效率,而检索效率是任何稳健的GenAI流程的基础能力。通过优化,检索组件可以显著提升系统提供上下文准确、及时且有意义的响应的能力,从而直接影响用户体验和AI输出的可信度。
The importance of this chapter lies in its detailed, hands-on approach to improving retrieval effectiveness, a foundational capability for any robust GenAI pipeline. Through optimization, retrieval components can significantly elevate a system's ability to provide contextually accurate, timely, and meaningful responses, thereby directly influencing the user experience and trustworthiness of AI outputs.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在帮助读者全面理解构建高性能信息检索系统所必需的检索优化技术。本章重点介绍基于模态的路由、查询扩展、混合检索和交叉编码器重排序等策略,旨在提高搜索任务的召回率和精确率。读者将通过实际代码示例学习如何实现这些技术,从而构建准确、自适应且高效的检索流程。这些技能对于改进现代人工智能系统的基础检索层至关重要,尤其是在多模态和RAG工作流程中。
The objective of this chapter is to equip readers with a comprehensive understanding of retrieval optimization techniques essential for building high-performance information retrieval systems. Focusing on strategies like modality-based routing, query expansion, hybrid retrieval, and cross-encoder reranking, the chapter aims to enhance both recall and precision in search tasks. Readers will learn how to implement these techniques through practical code examples, enabling them to build retrieval pipelines that are accurate, adaptive, and efficient. These skills are crucial for improving the foundational retrieval layer of modern AI systems, particularly in multimodal and RAG workflows.
我们已经实现了基于交叉编码器和多索引嵌入的重排序。在本章中,我们将重点探索其他检索优化技术,以进一步提高相关性、效率和多模态适应性。
We have already implemented reranking using cross-encoders and multi-index embedding. In this chapter, we now turn our attention to exploring additional retrieval optimization techniques that further improve relevance, efficiency, and multimodal adaptability.
在查询时,用户查询被编码成一个向量,并使用近似最近邻(ANN )搜索将其与存储的文档嵌入进行比较,从而检索出最相似的前k个候选文档。这些向量搜索结果随后被转发到一个交叉编码器重排序器,该重排序器联合处理原始查询和每个候选文档,通过完整的词元级交互来计算更细粒度的相似度得分。重排序器基于语义相关性对结果进行重新排序,从而生成一组更准确的前k个重排序文档。
At query time, the user query is encoded into a vector and compared against the stored document embeddings using approximate nearest neighbor (ANN) search, retrieving the top-k most similar candidates. These vector search results are then forwarded to a cross-encoder reranker, which jointly processes the original query and each candidate document to compute fine-grained similarity scores via full token-level interaction. The reranker reorders the results based on semantic relevance, producing a more accurate set of top-k reranked documents.
这些重新排序后的文档连同原始用户查询一起被送入大型语言模型(LLM )进行合成。LLM生成最终答案并返回给用户。这种两阶段设计兼顾了可扩展性(通过双编码器检索)和精确性(通过交叉编码器重排序),从而实现了高效且高质量的响应生成。
These reranked documents, along with the original user query, are passed into the large language model (LLM) for synthesis. The LLM generates the final answer, which is returned to the user. This two-stage design balances scalability (via bi-encoder retrieval) with precision (via cross-encoder reranking), resulting in both efficient and high-quality response generation.
多模态随机抽样(RAG)检索系统面临几个关键缺陷,限制了其在实际应用中的有效性。首先,传统的检索流程通常将不同模态独立处理,导致文本和视觉信息的融合效果欠佳。此外,它们严重依赖静态嵌入,而静态嵌入无法捕捉不断变化的用户意图或上下文细微差别。跨模态相关性评分是另一个挑战,常常导致输出结果不相关或不匹配。而且,处理大规模多模态数据集时,延迟会显著增加。
Retrieval systems in multimodal RAG face several critical drawbacks that limit their effectiveness in real-world applications. First, traditional retrieval pipelines often treat modalities independently, leading to suboptimal fusion of textual and visual information. They also rely heavily on static embeddings, which can fail to capture evolving user intent or contextual nuances. Cross-modal relevance scoring is another challenge, often resulting in irrelevant or mismatched outputs. Furthermore, latency increases significantly when dealing with large-scale, multimodal datasets.
请参考以下列表,了解阻碍多模态 RAG 系统准确性和效率的局限性,这些局限性需要更具适应性、智能性和统一性的检索机制才能在未来取得进步:
Refer to the following list to understand the limitations that hinder both the accuracy and efficiency of multimodal RAG systems, necessitating more adaptive, intelligent, and unified retrieval mechanisms for future advancements:
下表概述了信息检索系统中常见的缺陷,并将每项限制与其对检索性能的相应影响进行了映射。理解这些挑战有助于我们了解在现代检索架构中优化召回率、精确率、语义理解、多模态对齐、索引新鲜度、排序有效性和上下文感知所涉及的权衡和复杂性。
The following table outlines common drawbacks encountered in information retrieval systems, mapping each limitation to its corresponding impact on retrieval performance. Understanding these challenges highlights the trade-offs and complexities involved in optimizing recall, precision, semantic comprehension, multimodal alignment, index freshness, ranking effectiveness, and contextual awareness in modern retrieval architectures.
|
退税 Drawback |
影响 Impact |
|
召回率低与精确率之间的权衡 Poor recall vs. precision trade-offs |
检索系统必须在完整性(召回率)和准确性(精确率)之间做出权衡,因此很难同时优化这两者。这可能导致检索到不相关的结果或遗漏相关的文档。 Retrieval systems must compromise between completeness (recall) and accuracy (precision), making it challenging to optimize both simultaneously. This may cause the retrieval of irrelevant results or missing relevant documents. |
|
语义理解能力有限 Limited semantic understanding |
系统无法理解深层含义或意图,导致在短语不同或上下文有细微差别时错过相关文档,造成遗漏或不相关的搜索结果,尤其是在语义或多模态检索中。 Systems fail to grasp deep meaning or intent, leading to missed relevant documents when phrases differ or context is nuanced, causing omissions or irrelevant hits, especially in semantic or multimodal retrievals. |
|
多模态检索中的模态不匹配 Modality mismatch in multimodal retrieval |
不同数据类型(例如图像和文本)之间的错误对齐会导致检索失败,例如返回错误的模态项或缺少相关的跨模态结果,从而降低系统效率。 Incorrect alignment between different data types (e.g., images and text) results in retrieval failures, such as returning wrong modality items or missing relevant cross-modal results, reducing system effectiveness. |
|
索引过时(索引过期) Index staleness (outdated index) |
过时的索引会遗漏最新信息,包含过时的内容,降低召回率和精确度,使检索结果不准确、不及时。 Outdated indexes omit recent information, include obsolete content, and degrade both recall and precision, making retrieval results less accurate and less timely. |
|
排名低效 Ranking inefficiencies |
快速但粗略的排名可能会埋没相关文档,抬高不相关文档,从而降低结果排序的有效性,除非采用速度较慢、更复杂的重新排名,但这会增加成本和延迟。 Fast but approximate ranking may bury relevant documents and elevate irrelevant ones, reducing the effectiveness of result ordering unless slower, complex reranking is applied at added cost and latency. |
|
缺乏情境意识 Lack of contextual awareness |
由于检索将查询和文档孤立地处理,丢失了更广泛的用户意图和叙述背景,因此检索结果可能在上下文上具有误导性或无用,尤其是在数据碎片化或多模态的情况下。 Results can be contextually misleading or unhelpful because retrieval treats queries and documents in isolation, losing broader user intent and narrative context, especially in fragmented or multimodal data. |
表 10.1:检索系统局限性与其对搜索质量和用户体验的实际影响之间的对应关系
Table 10.1: Mapping retrieval system limitations to their practical impacts on search quality and user experience
为了克服上述局限性,现代检索系统采用了一系列优化技术。以下方法通过解决特定缺陷,提高了文本、多模态和红黄绿(RAG)场景下的召回率、精确率和相关性。
To mitigate the preceding limitations, modern retrieval systems employ a range of optimization techniques. The following methods enhance recall, precision, and relevance across textual, multimodal, and RAG scenarios by addressing specific drawbacks.
多索引嵌入或多向量表示技术并非使用单个向量或单个索引来表示每个文档,而是为每个条目使用多个嵌入,或使用多个按内容划分的索引。一种常见的方法是多向量索引,它将长文档分割成多个部分,每个部分都使用自己的嵌入进行索引。这确保了文档的不同主题方面都能被捕获,从而提高了至少有一个部分与相关查询匹配的概率。其结果是,对于复杂或冗长的文档,可以实现更高的召回率和更精细的语义匹配;系统不再会因为信息隐藏在长文本中而遗漏信息。此外,每个文档使用多个向量可以通过提供更细致的表示来提高精确度;每个向量覆盖特定的上下文,因此文档中无关的部分不太可能导致错误匹配。在实践中,多索引嵌入通过捕获内容的不同方面来提高语义覆盖率和上下文保留率,并增强检索的准确性和理解力。例如,一篇技术论文的摘要、方法和结论可能分别使用不同的嵌入。针对论文方法的提问将直接指向方法嵌入部分,而不是依赖可能丢失细节的单一向量。在 RAG 流水线中,多向量方案同样能够有效地查询长篇知识文章,而不会丢失相关细节,从而解决长文档召回率低的问题,并减少文档中上下文信息的丢失。
Instead of representing each document with a single vector or in a single index, multi-index embedding or multi-vector representations techniques use multiple embeddings per item or multiple indexes specialized by content. One common approach is multi-vector indexing, where a long document is segmented into multiple parts, each indexed by its own embedding. This ensures that different topical aspects of a document are captured, improving the chances that at least one segment will match a relevant query. The result is higher recall and finer semantic matching for complex or lengthy documents; the system no longer misses information just because it was buried in a long text. Moreover, considering multiple vectors per document can improve precision by giving a more nuanced representation; each vector covers a specific context, so irrelevant parts of a document are less likely to cause false matches. In practice, multi-index embedding improves semantic coverage and context retention by capturing different facets of content, and it enhances retrieval accuracy and understanding. For example, a technical paper might have separate embeddings for its abstract, methods, and conclusion. A question about the paper’s method will directly hit the method embedding segment, rather than relying on a single vector that might dilute this detail. In RAG pipelines, multi-vector schemes similarly allow long knowledge articles to be queried effectively without losing pertinent details, thereby addressing poor recall on long documents and mitigating the loss of context within those documents.
为了解决模态不匹配问题,检索架构引入了基于模态的路由机制,这意味着查询会被定向到特定模态的索引或模型。系统不会将所有数据类型强制整合到单一的同质表示中,而是维护针对文本、图像、音频等优化的独立管道,然后将输出结果进行合并。例如,一个多模态搜索引擎可能拥有一个用于文本段落的向量索引和一个用于图像嵌入的向量索引;如果查询同时包含图像和文本,它会将每个部分路由到相应的索引。这样,每种模态都会使用最合适的检索方法进行处理,例如,图像使用对比语言-图像预训练(CLIP )嵌入,文本使用基于双向编码器表示(BERT )的嵌入,而不会出现一种模态的噪声干扰另一种模态的情况。通过隔离模态,系统避免了对不可比较特征的直接比较,从而减少了跨模态误差。在实践中,可以并行查询多个索引,然后对结果进行后期融合,从而确保……系统会考虑每种模态的最佳结果。如果用户提出的问题附带图像示例,则可以使用该图像来检索相似图像,同时文本查询可以检索相关文档;然后可以将结果合并。另一种策略是联合嵌入空间(一种模型层面的路由):使用像 CLIP 这样的模型,学习文本和图像的共享向量空间,从而可以直接比较图像查询和图像描述。这使得不同模态能够使用通用的向量语言进行交流,从而大大缓解模态不匹配问题。基于模态的路由(无论是通过单独的索引还是联合嵌入)确保每种数据类型的独特特征都得到尊重,从而通过解决跨模态对齐的挑战来提高多模态检索的精确率和召回率。
To tackle modality mismatch, retrieval architectures introduce modality-based routing, which means that queries are directed to modality-specific indexes or models. Rather than forcing all data types into one homogeneous representation, the system maintains separate pipelines optimized for text, images, audio, etc., and then combines the outputs. For example, a multimodal search engine might have one vector index for text passages and another for image embeddings; if a query contains both an image and text, it routes each part to the appropriate index. This way, each modality is handled with the best-suited retrieval method, e.g., Contrastive Language–Image Pretraining (CLIP) embeddings for images, Bidirectional Encoder Representations from Transformers (BERT) based embeddings for text, without one modality’s noise confusing the other. By isolating modalities, the system avoids direct comparisons of incomparable features and thus reduces cross-modal error. In practice, one can query multiple indexes in parallel and then perform a late fusion of results, ensuring that the top results from each modality are considered. If a user asks a question with an image example attached, the image can be used to fetch similar images while the text query retrieves relevant documents; the results can then be merged. Another strategy is joint embedding spaces (a form of routing at the model level): using models like CLIP, which learn a shared vector space for text and images so that an image query and a caption can be directly compared. This aligns modalities to speak a common language of vectors, greatly alleviating the modality mismatch problem. Modality-based routing (whether via separate indices or joint embeddings) ensures that each data type’s unique characteristics are respected, thereby improving precision and recall in multimodal retrieval by addressing the cross-modal alignment challenge.
查询扩展是提高召回率和弥合文本检索中词汇语义差距的经典技术。其核心思想是使用含义相近的附加词组或短语来扩展用户的查询,包括同义词、相关概念或替代表述。通过自动扩展查询,系统可以检索到原本可能因措辞差异而遗漏的文档。例如,关于“全球变暖的影响”的查询可以扩展为包含“气候变化的影响”等词组,从而将使用这两个词组的文档都纳入考虑范围。这直接解决了召回率与精确率权衡中召回率低的问题:扩展增加了找到的相关结果数量(但会牺牲一些精确率)。在实践中,现代系统使用词库、语言模型甚至语言学习模型(LLM)来生成扩展。在 RAG 的背景下,查询扩展可以为检索器提供问题的多种表述方式,从而为生成器生成更丰富的上下文段落。虽然这可能会引入一些不相关的结果(因为查询范围更广),但它显著降低了遗漏隐藏在不同术语背后的相关信息的概率。智能扩展策略(例如,仅添加高度相关的同义词,或利用初始结果的反馈进一步扩展)有助于在提高召回率的同时保持精确度。通过覆盖更广泛的语义范围,查询扩展可以弥补严格关键词搜索语义理解的不足,并提高系统在词汇不匹配的情况下查找相关数据的能力。
Query expansion is a classic technique to improve recall and bridge lexical-semantic gaps in textual retrieval. The idea is to expand the user’s query with additional terms or phrases that have similar meaning, including synonyms, related concepts, or alternate formulations. By automatically broadening the query, the system retrieves documents it might otherwise miss due to wording differences. For instance, a query on global warming effects could be expanded with terms like climate change impacts so that documents using either term are considered. This directly addresses the poor recall aspect of the recall-precision trade-off: expansion increases the number of relevant results found (at some cost to precision). In practice, modern systems use thesauri, language models, or even LLMs to generate expansions. In the context of RAG, query expansion can feed the retriever multiple reformulations of a question, yielding a richer set of context passages for the generator. While this may introduce a few more irrelevant hits (since the query is broader), it significantly reduces the chance of missing pertinent information hidden behind different terminology. Smart expansion strategies (e.g., only adding highly relevant synonyms, or using feedback from initial results to expand further) help maintain precision while boosting recall. By covering more semantic ground, query expansion mitigates the limited semantic understanding of strict keyword search and improves the system’s ability to find relevant data despite vocabulary mismatches.
嵌入归一化是基于向量的检索中一种底层但至关重要的优化。它解决了一个常被忽视的问题:向量嵌入的长度(大小)可能不同,这会影响相似度计算。例如,如果一个文档的嵌入范数大于另一个文档,即使方向(语义内容)不太一致,它与查询的点积相似度得分也可能更高。归一化(通常是 L2 归一化到单位长度)确保所有向量都位于同一个超球面上,从而使相似度完全由角度(余弦相似度)而非向量长度决定。这提高了检索的语义保真度——文档的检索是基于其内容的真正相似性,而不仅仅是因为它们的嵌入范数更大。归一化的嵌入还带来了数值稳定性和一致性:最大化内积等价于最大化余弦相似度,使得检索指标表现良好且在不同查询之间具有可比性。在实践中,许多嵌入模型已经输出归一化向量,或者提供了输出归一化向量的选项;如果没有,向量数据库通常允许标记数据应被视为已归一化。通过防止任何单个向量因长度异常而占据主导地位,归一化可以产生更可靠的结果排序(解决了排序效率低下的一个微妙问题)。这在多模态场景或合并来自不同模型的结果时尤为重要,因为它们的嵌入尺度可能不同。确保统一的尺度可以消除一个误差来源,使检索能够专注于真正的语义相似性。总之,嵌入归一化微调了检索引擎的数学基础,从而提高了结果的精确度和一致性。
Embedding normalization is a low-level but crucial optimization in vector-based retrieval. It addresses an often-overlooked issue: vector embeddings can vary in length (magnitude), which can skew similarity computations. For example, if one document’s embedding has a larger norm than another’s, it might score higher on a dot product similarity with a query even if the direction (semantic content) is less aligned. Normalization (typically L2 normalization to unit length) ensures that all vectors lie on the same hypersphere, so that similarity is determined purely by angle (cosine similarity) rather than vector length. This improves the semantic fidelity of retrieval—documents are retrieved for being truly similar in content, not just because their embedding has a larger magnitude. Normalized embeddings also bring numerical stability and consistency: maximizing inner product becomes equivalent to maximizing cosine similarity, making the retrieval metric well-behaved and comparable across queries. In practice, many embedding models already output normalized vectors or have an option to do so; if not, vector databases often allow flagging that data should be treated as normalized. By preventing any single vector from dominating due to length anomalies. normalization yields a more reliable ranking of results (addressing a subtle ranking inefficiency). It is especially important in multimodal settings or when merging results from different models, as their embedding scales might differ. Ensuring a uniform scale removes one source of error, letting the retrieval focus on true semantic similarity. In summary, embedding normalization fine-tunes the retrieval engine’s mathematical underpinning to enhance precision and consistency in results.
混合检索结合了关键词(词汇)搜索和向量(语义)搜索的优势,以克服每种方法的不足。混合系统并非依赖单一方法,而是同时执行词汇匹配(例如,BM25 或 TF-IDF 索引)和语义相似性搜索(通过词嵌入),然后将结果融合。这项技术直接解决了纯关键词搜索语义理解能力有限的问题,以及纯语义搜索可能遗漏精确或罕见术语的问题。通过结合使用这两种方法,系统可以在精确匹配术语和更广泛的语义覆盖范围之间取得平衡。例如,考虑一个包含特定错误代码和一般问题描述的技术查询:BM25 组件将确保检索到包含该精确错误代码的文档,而词嵌入组件将检索到描述一般问题的文档,即使它们的表述方式有所不同。然后,对合并后的候选列表进行排序融合或重新排序,即可得到比任何单一方法都更全面、更相关的最终排名。现代 RAG 流程经常使用这种方法;首先,通过词汇搜索收集前 N 个段落,通过向量搜索收集前 M 个段落,然后对它们进行去重和重新排序。结果显著提高了召回率和精确率,Anthropic 的示例就证明了这一点,其中同时使用这两种方法可以返回更多适用于生成的文本块。混合检索还可以缓解上下文丢失:词汇匹配可以提供嵌入可能忽略的精确上下文标识符(例如名称或数字),从而将语义结果锚定在具体细节上。总而言之,这种优化通过有效地结合两种评分信号来解决召回率/精确率之间的权衡,从而产生既准确又具有语义感知能力的检索结果。
Hybrid retrieval combines the strengths of keyword (lexical) search and vector (semantic) search to overcome each method’s weaknesses. Rather than relying on one approach, a hybrid system performs both a lexical match (e.g., BM25 or TF-IDF index) and a semantic similarity search (via embeddings) and then merges the results. This technique directly confronts the limited semantic understanding of pure keyword search and the complementary issue that pure semantic search can miss exact or rare terms. By using both, the system can balance precise term matching with broader semantic coverage. For example, consider a technical query containing a specific error code and a general problem description: the BM25 component will ensure documents containing that exact error code are retrieved, while the embedding component will fetch documents about the general problem even if they phrase it differently. Rank fusion or reranking of the combined candidate list then yields a final ranking that is more comprehensive and relevant than either method alone. Modern RAG pipelines frequently use this approach; first, gather a set of top-N passages by lexical search and top-M by vector search, then deduplicate and rerank them together. The result is significantly improved recall and precision, as evidenced by Anthropic’s example, where using both methods returns more applicable chunks for generation. Hybrid retrieval also mitigates context loss: lexical matches can provide the exact contextual identifiers (like names or numbers) that an embedding might overlook, anchoring the semantic results in concrete details. Overall, this optimization addresses recall/precision trade-offs by effectively combining two scoring signals, yielding a retrieval that is both accurate and semantically aware.
混合检索系统融合了基于关键词(词汇)搜索和语义(向量)搜索的输出结果,但其主要技术挑战在于如何将二者截然不同的评分机制整合为统一的排名。词汇模型(如BM25)基于词频和文档统计数据生成分数,而向量搜索则提供相似度度量(通常是余弦距离或欧氏距离),这些度量方法无法直接比较,甚至数值尺度也不相同。
Hybrid retrieval systems merge the outputs of keyword-based (lexical) search and semantic (vector) search, but a major technical challenge lies in combining their fundamentally different scoring schemes into a unified ranking. Lexical models like BM25 produce scores based on term frequency and document statistics, while vector search provides similarity measures, often cosine or Euclidean distance, which are not directly comparable or even on the same numeric scale.
为了解决这个问题,在合并结果之前会应用分数归一化技术。归一化过程将每种方法的分数转换到通用尺度(通常是标准化的z分数),从而实现公平的组合和融合。典型的策略包括:
To address this, score normalization techniques are applied before merging results. The normalization process transforms scores from each method into a common scale (often or standardized z-scores), allowing fair combination and fusion. Typical strategies include:
例如,在一个典型的混合流程中,首先从 BM25 算法中选取前 N 个结果,从嵌入搜索中选取前 M 个结果。然后对它们的得分进行归一化,合并重复结果(通常保留每种方法得分最高的结果),并使用融合(组合或加权)后的得分对最终列表进行重新排序。此过程确保精确的关键词匹配(例如,ID 或罕见词)不会被语义相似但不够精确的内容所掩盖,反之亦然。
For example, in a typical hybrid pipeline, top-N results from BM25 and top-M from embedding search are first selected. Their scores are then normalized, duplicate hits are merged (often keeping the best score per method), and the final list is reranked using the fused (combined or weighted) scores. This process ensures that precise keyword matches (e.g., for IDs or rare terms) are not overshadowed by semantically similar but less precise content, and vice versa.
分数归一化对于混合检索至关重要,它可以避免由于数值尺度差异而导致一种模态占据主导地位,最终使系统能够利用词汇精确性和语义广度的优势,从而获得最佳的检索性能。
Score normalization is essential for hybrid retrieval to avoid one modality dominating due to numerical scale differences, ultimately enabling the system to leverage the strengths of both lexical precision and semantic breadth for the best possible retrieval performance.
我们已经对使用交叉编码器进行重排序有了初步的了解。然而,让我们更深入地了解它。为了解决第一阶段检索排序效率低下的问题,系统通常会采用交叉编码器(或其他强大的重排序器)进行重排序。交叉编码器是一种Transformer模型,它将查询和候选文档作为输入,并生成相关性得分,从而有效地执行包含完整上下文的深度语义比较。这比双向编码器模型中使用的独立编码(其中查询和文档分别嵌入)要准确得多。当然,缺点是对所有可能的文档都进行这样的操作是不切实际的;但是,对一小部分排名靠前的候选文档(例如,来自初始检索器的前50或100个)进行操作通常是可行的。因此,策略是使用快速检索器获取候选文档池,然后应用交叉编码器对这些候选文档进行高精度重排序。这种两阶段方法通过结合速度和准确性解决了之前遇到的权衡问题。交叉编码器可以纠正第一阶段的错误。例如,可以发现排名靠前的段落可能只是表面上与查询匹配,并不相关,因此将其降级到真正相关的段落之下,而这些段落可能在初始阶段排名更低。经验表明,添加交叉编码器重排序器可以显著提升平均倒数排名( MRR ) 或 precision@k 等指标,因为它过滤掉了误报。基于对查询上下文和文档内容的更深入理解,对结果进行重新排序。在 RAG 系统中,改进的重排序意味着 LLM 可以获得更多相关的上下文段落,从而直接提高答案质量。代价是需要额外的计算,但存在一些优化方法(例如,使用更小的交叉编码器或仅对子集进行重排序)。总而言之,交叉编码器重排序是针对排序效率低下的一种有针对性的解决方案,因为它在最终排名前列的候选答案时,注入了高上下文、高精度的判断,以确保结果尽可能地相关且符合上下文。
We have already built an understanding of reranking with a cross-encoder. However, let us understand it more thoroughly. To tackle the ranking inefficiency of first-stage retrieval, systems often employ a reranking step with cross-encoders (or other powerful rerankers). A cross-encoder is a transformer model that takes a query and a candidate document together as input and produces a relevance score, effectively performing a deep semantic comparison with full context. This is far more accurate than the independent encoding used in bi-encoder models (where query and document are embedded separately). The drawback, of course, is that doing this for every possible document is infeasible; however, doing it for a small set of top candidates (say, top 50 or 100 from the initial retriever) is usually manageable. The strategy, therefore, is to use a fast retriever to get a candidate pool, then apply a cross-encoder to rerank those candidates with high precision. This two-stage approach addresses the earlier trade-off by combining speed and accuracy. The cross-encoder corrects the mistakes of the first-stage. For instance, it can be noticed that a top-ranked passage is only superficially matching the query and not relevant, demoting it below a truly relevant passage that maybe the initial stage had a lower rank. Empirically, adding a cross-encoder reranker significantly boosts metrics like Mean Reciprocal Rank (MRR) or precision@k, as it filters out false positives and reorders results based on a richer understanding of query context and document content. In a RAG system, improved reranking means the LLM gets more relevant grounding passages, directly improving answer quality. The cost is extra computation, but optimizations exist (e.g., using smaller cross-encoders or only reranking a subset). Overall, cross-encoder reranking is a targeted fix for ranking inefficiency as it injects a high-context, high-precision judgment just where it is needed, at the final ranking of top candidates to ensure the results are as relevant and contextually appropriate as possible.
预过滤阈值是多阶段检索系统中常用的一种优化方法,用于降低使用交叉编码器进行重排序的计算延迟。由于对每个候选文档都进行交叉编码器评估速度极慢,预过滤阈值有助于确保只有最有希望的候选文档(即最有可能相关的文档)才会被传递给计算量较大的重排序过程。以下是其工作原理和有效性:
Prefiltering thresholds are a practical optimization used in multi-stage retrieval systems to reduce the computational latency of reranking with cross-encoders. Since evaluating a cross-encoder on every candidate document is prohibitively slow, prefiltering thresholds help ensure that only the most promising candidates (i.e., those likely to be relevant) are passed on for costly reranking. Here is how this works and why it is effective:
预过滤阈值的重要性如下:
The importance of Prefiltering thresholds is as follows:
以下是一个实现示例:
The following is an example of implementation:
以下是其优点:
The following are the benefits:
预过滤阈值充当快速检索和交叉编码器缓慢但精确的重排序之间的智能过滤器,确保只有最有希望的文档才会被重排序。这种方法通过将重排序工作量减少到可控的高概率子集,使您能够享受到交叉编码器的高精度,而无需为每个候选文档承担过高的推理成本或延迟。
Prefiltering thresholds act as a smart filter between fast retrieval and slow, accurate reranking by cross-encoders, ensuring only the most promising documents are reranked. This approach enables you to enjoy the high precision of a cross-encoder—but without incurring prohibitive inference cost or latency for every candidate—by reducing the reranking workload to a manageable, high-likelihood subset.
最后,为了应对索引过时和嵌入漂移,检索系统实现了如图 10.1所示的自适应索引刷新策略。这意味着索引并非一劳永逸的静态结构,而是会按计划或根据变化进行更新。其中一个方面是增量索引:随着新文档的到来或现有文档的更改,它们会被持续或定期地添加到搜索索引中(或重新索引),而不是等待完整的重新索引。这可以保持内容的时效性,并确保检索到最新信息。在实践中,生产系统拥有一个索引更新管道,该管道会提供新数据并使用后台处理来保持向量存储的最新状态。另一个方面是适应嵌入模型本身的变化。如果系统的向量编码器被重新训练或替换(例如,部署了更新的语言模型),则存储的嵌入可能不再兼容或最优。自适应刷新意味着在模型更新或检测到显著漂移时重新嵌入语料库。监控可用于判断何时需要重新嵌入(例如,相似度得分开始下降或召回率下降时)。通过在最新模型上重新计算嵌入并将其替换到索引中,系统可以保持查询嵌入和文档嵌入之间的一致性,从而防止因嵌入漂移而导致的相关性不匹配。总之,自适应索引刷新通过确保检索索引始终反映数据和模型的理解,解决了索引过时的问题。这使得检索更加准确及时:可以搜索新知识,并且相似度比较结果始终有效。这方面的技术包括计划性重新索引、流数据的实时索引,以及混合方法(如果新数据尚未索引,则实时搜索新数据,然后回退到慢速搜索)。这些实践共同保证了检索系统的知识保持最新,并且其向量空间保持一致,从而在内容和模型不断演变的情况下维持检索性能。
Finally, to combat index staleness and embedding drift, retrieval systems implement adaptive index refresh policies as shown in Figure 10.1. This means the index is not a once-and-done static structure but is updated on a schedule or in response to changes. One aspect is incremental indexing: as new documents arrive or existing ones change, they are added to (or reindexed in) the search index continuously or periodically, rather than waiting for a complete reindexing. This keeps the content fresh and ensures recall of up-to-date information. In practice, production systems have an index update pipeline that feeds new data and uses background processing to keep the vector store current. Another aspect is adapting to changes in the embedding model itself. If the system’s vector encoder is retrained or replaced (for example, a newer language model is deployed), the stored embeddings may no longer be compatible or optimal. Adaptive refresh entails re-embedding the corpus when models are updated or when significant drift is detected. Monitoring can be used to decide when re-embedding is necessary (e.g., if similarity scores start degrading or recall@k drops). By re-computing embeddings on the latest model and swapping them into the index, the system maintains alignment between query embeddings and document embeddings, preventing the relevance mismatches that arise from embedding drift. In sum, adaptive index refresh addresses the staleness drawback by ensuring the retrieval index remains a living reflection of both the data and the model’s understanding. This results in more accurate and timely retrieval: new knowledge is searchable, and the similarity comparisons remain valid over time. Techniques in this vein include scheduled reindexing, real-time indexing for streaming data, and hybrid approaches where recent data is searched live (fallback to slow search) if not yet indexed. Together, these practices guarantee that the retrieval system’s knowledge stays current and its vector space stays consistent, thus upholding retrieval performance in the face of evolving content and models.
因此,现代检索系统,无论是纯文档检索、多模态检索还是随机抽取检索(RAG),都远非静态的关键词匹配器。它们是复杂且不断演进的系统,必须克服召回率、精确率、语义理解、跨模态对齐和上下文处理等方面的根本性限制。通过应用上述优化技术,此类系统的鲁棒性和相关性显著提升:多向量表示丰富了索引内容,模态特定处理对齐了不同的数据类型,查询扩展拓宽了搜索范围,嵌入归一化和混合搜索优化了匹配过程,重排序器注入了智能排序,持续的索引更新则确保系统始终保持最新状态。每项技术都针对特定的缺陷,它们共同作用,使检索流程能够在日益多样化和要求更高的应用中提供高质量、上下文感知的结果。这些方法的相互作用充分展现了概念创新(而不仅仅是数学复杂性)如何能够显著提升信息检索的性能和可靠性。
So, modern retrieval systems, whether pure document search, multimodal search, or RAG are far from static keyword-matchers. They are complex, evolving systems that must overcome fundamental limitations in recall, precision, semantic understanding, cross-modal alignment, and context handling. By applying the above optimization techniques, such systems markedly improve in robustness and relevance: multi-vector representations enrich what is indexed, modality-specific handling aligns disparate data types, query expansion broadens the search horizon, embedding normalization and hybrid search refine the matching process, rerankers inject intelligent ordering, and continuous index refresh keeps the system up-to-date. Each technique targets specific drawbacks, and together they enable retrieval pipelines to provide high-quality, context-aware results in increasingly diverse and demanding applications. The interplay of these methods exemplifies how conceptual innovation (rather than just mathematical complexity) can drive substantial improvements in information retrieval performance and reliability.
在高性能检索系统中,尤其是在支持多模态输入和红绿灯(RAG)的系统中,优化检索过程对于实现高精度、高召回率和上下文相关性至关重要。本节详细介绍了使用 Python 和 Qdrant 实现的核心检索优化策略,其中词嵌入由句子转换器(Sentence Transformer)生成。每项技术都源于一个实际挑战,并通过模块化、可重用的代码进行验证;详情如下:
In high-performance retrieval systems, especially those supporting multimodal inputs and RAG, optimizing the retrieval process is essential for achieving high precision, recall, and contextual relevance. This section details the implementation of core retrieval optimization strategies using Python and Qdrant, with embeddings generated via Sentence Transformers. Each technique is motivated by a real-world challenge and substantiated with modular, reusable code; details as follows:
def route_query(query: str, modality: str = "text") -> str:
路由表 = {
"text": "text_index",
"image": "image_index",
“多模态”: “混合索引”
}
返回 routing_table.get(modality, "text_index")
此功能确保每个查询都由最相关的子索引处理,避免模态不匹配,提高检索精度。
def route_query(query: str, modality: str = "text") -> str:
routing_table = {
"text": "text_index",
"image": "image_index",
"multimodal": "hybrid_index"
}
return routing_table.get(modality, "text_index")
This function ensures that each query is processed by the most relevant sub-index, avoiding modality mismatch and improving retrieval precision.
def query_expansion(query: str) -> list:
synonym_dict = {
“气候”: [“环境”,“天气”]
“汽车”:[“车辆”,“汽车”]
}
words = query.split()
expanded = set(words)
for word in words:
如果单词在 synonym_dict 中:
expanded.update(synonym_dict[word])
返回列表(展开后)
通过将气候扩展到包括环境和天气,检索系统更有可能返回使用替代术语的、概念相关的文档。
def query_expansion(query: str) -> list:
synonym_dict = {
"climate": ["environment", "weather"],
"car": ["vehicle", "automobile"]
}
words = query.split()
expanded = set(words)
for word in words:
if word in synonym_dict:
expanded.update(synonym_dict[word])
return list(expanded)
By expanding climate to include environment and weather, the retrieval system is more likely to return conceptually relevant documents that use alternate terminology.
def normalize_embedding(embedding: np.ndarray) -> np.ndarray:
norm = np.linalg.norm(embedding)
如果 norm != 0,则返回嵌入 / norm;否则返回嵌入。
该函数保证所有嵌入都位于单位超球面上,从而确保语义相似性仅通过角度距离来判断,进而提高跨索引的评分可靠性。
def normalize_embedding(embedding: np.ndarray) -> np.ndarray:
norm = np.linalg.norm(embedding)
return embedding / norm if norm != 0 else embedding
This function guarantees that all embeddings lie on a unit hypersphere, ensuring semantic similarity is judged by angular distance alone, thus improving scoring reliability across indexes.
def weighted_embedding_fusion(text_emb: np.ndarray, image_emb: np.ndarray, text_weight: float = 0.6) -> np.ndarray:
融合值 = 文本权重 * 文本嵌入 + (1 - 文本权重) * 图像嵌入
返回 normalize_embedding(fused)
这种融合技术允许偏向更可靠的模态(例如,法律文件中的文本、电子商务中的图像),并确保生成的向量仍然能够进行归一化以进行相似性搜索。
def weighted_embedding_fusion(text_emb: np.ndarray, image_emb: np.ndarray, text_weight: float = 0.6) -> np.ndarray:
fused = text_weight * text_emb + (1 - text_weight) * image_emb
return normalize_embedding(fused)
This fusion technique allows biasing towards more reliable modalities (e.g., text in legal documents, image in e-commerce), and ensures the resulting vector is still normalized for similarity search.
def score_fusion(results_a: list, results_b: list, method: str = "reciprocal_rank") -> list:
def reciprocal_rank(score, rank):
返回 1 / (排名 + 1)
fused_scores = {}
对于 rank、item 在 enumerate(results_a):
fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)
对于 rank、item 在 enumerate(results_b):
fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)
merged = [{"id": k, "fused_score": v} for k, v in fused_scores.items()]
返回 sorted(merged, key=lambda x: x["fused_score"], reverse=True)
该技术通过确保在任一列表中排名靠前的结果在合并输出中得到公平推广,从而减轻模态偏差。
def score_fusion(results_a: list, results_b: list, method: str = "reciprocal_rank") -> list:
def reciprocal_rank(score, rank):
return 1 / (rank + 1)
fused_scores = {}
for rank, item in enumerate(results_a):
fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)
for rank, item in enumerate(results_b):
fused_scores[item.id] = fused_scores.get(item.id, 0) + reciprocal_rank(item.score, rank)
merged = [{"id": k, "fused_score": v} for k, v in fused_scores.items()]
return sorted(merged, key=lambda x: x["fused_score"], reverse=True)
This technique mitigates modality bias by ensuring that results highly ranked in either list are promoted fairly in the merged output.
from qdrant_client.http.models import Filter, FieldCondition, MatchValue
def filter_by_metadata(source: str = None, year: int = None) -> Filter:
条件 = []
如果来源:
conditions.append(FieldCondition(key="source", match=MatchValue(value=source)))
如果年份:
conditions.append(FieldCondition(key="year", match=MatchValue(value=year)))
返回筛选条件(必须=条件)
Qdrant 允许在查询时通过元数据进行过滤。此功能可用于优先处理来自可靠来源或相关时间范围内的文档。
from qdrant_client.http.models import Filter, FieldCondition, MatchValue
def filter_by_metadata(source: str = None, year: int = None) -> Filter:
conditions = []
if source:
conditions.append(FieldCondition(key="source", match=MatchValue(value=source)))
if year:
conditions.append(FieldCondition(key="year", match=MatchValue(value=year)))
return Filter(must=conditions)
Qdrant allows filtering at query time via such metadata. This function can be used to prioritize documents from reliable sources or within a relevant timeframe.
from qdrant_client.http.models import VectorParams, Distance, PointStruct
def refresh_index(collection_name: str, data: list, encoder, vector_size: int, qdrant_client):
qdrant_client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
)
points = []
对于数据中的每个项目:
text = item.get("text") 或 item.get("desc")
vector = normalize_embedding(encoder.encode(text))
points.append(PointStruct(id=item["id"], vector=vector.tolist(), payload=item["metadata"]))
qdrant_client.upsert(collection_name=collection_name, points=points)
此功能允许定期或事件驱动的重新索引,确保存储的数据、元数据和不断发展的模型之间的一致性。
from qdrant_client.http.models import VectorParams, Distance, PointStruct
def refresh_index(collection_name: str, data: list, encoder, vector_size: int, qdrant_client):
qdrant_client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
)
points = []
for item in data:
text = item.get("text") or item.get("desc")
vector = normalize_embedding(encoder.encode(text))
points.append(PointStruct(id=item["id"], vector=vector.tolist(), payload=item["metadata"]))
qdrant_client.upsert(collection_name=collection_name, points=points)
This function allows periodic or event-driven reindexing, ensuring alignment between stored data, metadata, and evolving models.
为了有效利用遗传算法进行检索优化,将理论概念转化为可实际复现的代码至关重要。以下章节将演示如何在检索环境中模拟进化过程:定义一个评估配置性能的适应度函数,并建立表示、初始化和操作候选解的机制。通过这种方法,我们可以迭代地优化检索流程,从而自动发现更优的参数组合。这种实践性的实现为可扩展的、数据驱动的优化奠定了基础,减少了人工干预,并支持在复杂的搜索环境中进行快速实验。
To effectively harness the potential of genetic algorithms for retrieval optimization, it is essential to translate theoretical concepts into practical, reproducible code. The following section demonstrates how to simulate the evolutionary process in a retrieval context by defining a fitness function that evaluates configuration performance, and by establishing mechanisms to represent, initialize, and manipulate candidate solutions. Through this approach, we can iteratively refine retrieval pipelines, allowing for automated discovery of superior parameter combinations. This hands-on implementation lays the groundwork for scalable, data-driven optimization, reducing manual intervention and enabling rapid experimentation in complex search environments.
导入随机数
import random
导入 numpy 库并将其命名为 np
import numpy as np
# 示例适应度函数(您需要将其替换为实际的检索评估)
# Sample fitness function (you'd replace this with actual retrieval evaluation)
def evaluate_config(text_weight, use_query_expansion) -> float:
def evaluate_config(text_weight, use_query_expansion) -> float:
# 占位符:基于超参数模拟适应度评分
# Placeholder: simulate a fitness score based on hyperparameters
得分 = 0.7 * 文本权重 + (如果使用查询扩展则为 0.2,否则为 0)
score = 0.7 * text_weight + (0.2 if use_query_expansion else 0)
噪声 = np.random.uniform(-0.05, 0.05)
noise = np.random.uniform(-0.05, 0.05)
返回分数 + 噪声
return score + noise
# 将个体编码为 [text_weight, query_expansion_flag]
# Encode individuals as [text_weight, query_expansion_flag]
def initialize_population(size=10):
def initialize_population(size=10):
返回 [[random.uniform(0.3, 0.9), random.choice([0, 1])] for _ in range(size)]
return [[random.uniform(0.3, 0.9), random.choice([0, 1])] for _ in range(size)]
def mutate(individual):
def mutate(individual):
如果 random.random() < 0.5:
if random.random() < 0.5:
individual[0] = min(1.0, max(0.0, individual[0] + random.uniform(-0.1, 0.1)))
individual[0] = min(1.0, max(0.0, individual[0] + random.uniform(-0.1, 0.1)))
别的:
else:
individual[1] = 1 - individual[1] # 切换查询扩展
individual[1] = 1 - individual[1] # toggle query expansion
返回个人
return individual
def crossover(p1, p2):
def crossover(p1, p2):
返回 [(p1[0] + p2[0]) / 2, random.choice([p1[1], p2[1]])]
return [(p1[0] + p2[0]) / 2, random.choice([p1[1], p2[1]])]
def select(pop, scores, k=4):
def select(pop, scores, k=4):
返回 [pop[i] for i in np.argsort(scores)[-k:]]
return [pop[i] for i in np.argsort(scores)[-k:]]
def genetic_optimization(generations=20, pop_size=10):
def genetic_optimization(generations=20, pop_size=10):
population = initialize_population(pop_size)
population = initialize_population(pop_size)
最佳配置 = 无
best_config = None
最佳得分 = -np.inf
best_score = -np.inf
for gen in range(generations):
for gen in range(generations):
得分 = [evaluate_config(*ind) for ind in population]
scores = [evaluate_config(*ind) for ind in population]
top_individuals = select(population, scores)
top_individuals = select(population, scores)
new_population = top_individuals[:]
new_population = top_individuals[:]
当 len(new_population) < pop_size:
while len(new_population) < pop_size:
p1, p2 = random.sample(top_individuals, 2)
p1, p2 = random.sample(top_individuals, 2)
child = mutate(crossover(p1, p2))
child = mutate(crossover(p1, p2))
new_population.append(child)
new_population.append(child)
人口 = 新人口
population = new_population
gen_best = max(scores)
gen_best = max(scores)
如果 gen_best > best_score:
if gen_best > best_score:
最佳得分 = gen_best
best_score = gen_best
best_config = population[np.argmax(scores)]
best_config = population[np.argmax(scores)]
print(f"第 {gen+1} 代:最佳得分 = {gen_best:.4f}")
print(f"Gen {gen+1}: Best Score = {gen_best:.4f}")
print("\n找到最优参数:")
print("\nOptimal Parameters Found:")
print(f"文本权重:{best_config[0]:.2f},查询扩展:{'开启' 如果 best_config[1] 则 '关闭'}")
print(f"Text Weight: {best_config[0]:.2f}, Query Expansion: {'On' if best_config[1] else 'Off'}")
返回最佳配置
return best_config
这种基于遗传算法的检索优化方法旨在解决检索过程中各阶段(模态融合、查询重构、上下文评分)参数交互的难题。与基于梯度的方法不同,遗传算法不需要可微损失函数,并且可以同时在离散和连续的搜索空间中进行搜索。在我们的实现中:
This GA-based retrieval optimization method addresses the challenge of parameter interaction across retrieval stages (modality fusion, query reformulation, contextual scoring). Unlike gradient-based methods, GAs do not require a differentiable loss and can navigate discrete and continuous search spaces simultaneously. In our implementation:
通过将遗传算法集成到检索流程中,系统可以随着时间的推移进行自我优化,适应特定领域的需求(例如,在时尚搜索中更注重图像嵌入,而在法律语料库中更注重文本元数据)。
By integrating GAs into the retrieval pipeline, systems can self-optimize over time, adapting to domain-specific needs (e.g., placing more emphasis on image embeddings in fashion search vs. textual metadata in legal corpora).
从零开始搭建多模态 RAG 系统需要协调多个组件,以实现对文本和图像数据的智能质量保证。本指南提供分步说明。本文将详细介绍如何构建一个功能齐全的多模态 RAG 流水线,该流水线集成了基于 CLIP 的嵌入、用于向量存储的 ChromaDB 以及用于响应生成的 LangChain。该系统的一个关键特性是自适应索引刷新,它确保检索索引能够随着内容或嵌入模型的演变而保持最新。无论您是从原始文件开始,还是动态添加新数据,此设置都能使您的系统具备可扩展、上下文感知且准确的多模态搜索和生成能力。
Setting up a multimodal RAG system from scratch involves orchestrating multiple components to enable intelligent QA across both text and image data. This guide provides a step-by-step walkthrough for building a fully functional multimodal RAG pipeline, integrating CLIP-based embedding, ChromaDB for vector storage, and LangChain for response generation. A key feature of this system is adaptive index refresh, which ensures the retrieval index remains up-to-date with evolving content or embedding models. Whether you are starting with raw files or adding new data dynamically, this setup equips your system for scalable, context-aware, and accurate multimodal search and generation.
按照第 9 章“使用重排序构建 GenAI 系统”中给出的设置说明进行操作,并稍作修改。
Follow the setup instructions given in Chapter 9, Building GenAI Systems with Reranking, with minor changes.
下图展示了一个支持文本和图像输入的多模态 RAG 系统的架构。用户提交查询,查询内容根据输入模态使用文本或图像嵌入模型进行编码。文档和图像经过预处理并被分割成嵌入向量,存储在矢量数据库中。索引会定期刷新以保持与更新内容的同步。检索过程中,查询内容与存储的嵌入向量进行匹配,并将匹配结果最佳的模型传递给 LLM(逻辑逻辑模型),LLM 生成一个包含上下文信息的响应,并将其作为最终输出返回给用户,具体步骤将在后续章节中进行解释。
The following figure illustrates the architecture of a multimodal RAG system that supports both text and image inputs. A user submits a query, which is encoded using either a text or image embedding model, depending on modality. Documents and images are preprocessed and chunked into embeddings that are stored in a vector database. The index is periodically refreshed to maintain alignment with updated content. During retrieval, the query is matched against stored embeddings, and top results are passed to an LLM, which generates a contextually informed response that is returned to the user as the final output which is also explained in the following steps.
Figure 10.1: Multimodal RAG system with adaptive index refresh
1. 刷新索引(随时或按需):刷新所有索引(例如,添加了新文件或模型已更新):
1. Refresh the index (anytime or on demand): To refresh all indexes (e.g., new files added, or model updated):
python run_refresh.py
python run_refresh.py
或者使用应用界面中的“刷新索引”按钮。
Or use the Refresh Indexes button in the app UI.
2. 目录结构:按照下图所示设置文件夹:
2. Directory structure: Setup your folder as shown in the following figure:
以下列表是您的多模态 RAG 系统的完整端到端代码,该系统具有自适应索引刷新功能,并组织成清晰、模块化的.py文件:
The following list is the complete end-to-end code for your multimodal RAG system with adaptive index refresh, organized into clean, modular .py files:
####rag/config.py
CHROMA_PERSIST_DIR = "chromadb_storage"
CHROMA_IMAGE_COLLECTION = "laptop_images"
CHROMA_TEXT_COLLECTION = "laptop_texts"
IMAGE_FOLDER = "data/images"
TEXT_FOLDER = "data/documents"
EMBED_MODEL_NAME = "剪辑"
MODEL_NAME = "llama3" # 适用于 Ollama LLM
####rag/config.py
CHROMA_PERSIST_DIR = "chromadb_storage"
CHROMA_IMAGE_COLLECTION = "laptop_images"
CHROMA_TEXT_COLLECTION = "laptop_texts"
IMAGE_FOLDER = "data/images"
TEXT_FOLDER = "data/documents"
EMBED_MODEL_NAME = "clip"
MODEL_NAME = "llama3" # For Ollama LLM
####rag/embedding_utils.py
from transformers import CLIPProcessor, CLIPModel
导入 torch
从 PIL 导入图像
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def embed_text_ollama(text):
inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
使用 torch.no_grad():
outputs = clip_model.get_text_features(**inputs)
返回 outputs[0].tolist()
def embed_image_ollama(image_path):
image = Image.open(image_path).convert("RGB")
inputs = clip_processor(images=image, return_tensors="pt")
使用 torch.no_grad():
outputs = clip_model.get_image_features(**inputs)
返回 outputs[0].tolist()
####rag/embedding_utils.py
from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def embed_text_ollama(text):
inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = clip_model.get_text_features(**inputs)
return outputs[0].tolist()
def embed_image_ollama(image_path):
image = Image.open(image_path).convert("RGB")
inputs = clip_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = clip_model.get_image_features(**inputs)
return outputs[0].tolist()
###rag/loaders.py
导入操作系统
def load_text_documents(folder):
文档 = {}
for file in os.listdir(folder):
如果文件以“txt”结尾:
with open(os.path.join(folder, file), "r", encoding="utf-8") as f:
docs[file] = f.read()
返回文档
def load_image_paths(folder):
返回 [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
###rag/loaders.py
import os
def load_text_documents(folder):
docs = {}
for file in os.listdir(folder):
if file.endswith(".txt"):
with open(os.path.join(folder, file), "r", encoding="utf-8") as f:
docs[file] = f.read()
return docs
def load_image_paths(folder):
return [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
###rag/index_builder.py
导入操作系统
导入 chromadb
from .embedding_utils import embed_text_ollama, embed_image_ollama
从 .config 导入 *
from .loaders import load_text_documents, load_image_paths
def build_index():
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
# 文本集
如果 CHROMA_TEXT_COLLECTION 在 [c.name for c in client.list_collections()]:
client.delete_collection(name=CHROMA_TEXT_COLLECTION)
text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)
texts = load_text_documents(TEXT_FOLDER)
for idx, (fname, content) in enumerate(texts.items()):
emb = embed_text_ollama(content)
text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
# 图片集
如果 CHROMA_IMAGE_COLLECTION 在 [c.name for c in client.list_collections()]:
client.delete_collection(name=CHROMA_IMAGE_COLLECTION)
image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)
images = load_image_paths(IMAGE_FOLDER)
for idx, path in enumerate(images):
emb = embed_image_ollama(path)
image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
###rag/index_builder.py
import os
import chromadb
from .embedding_utils import embed_text_ollama, embed_image_ollama
from .config import *
from .loaders import load_text_documents, load_image_paths
def build_index():
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
# Text Collection
if CHROMA_TEXT_COLLECTION in [c.name for c in client.list_collections()]:
client.delete_collection(name=CHROMA_TEXT_COLLECTION)
text_collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)
texts = load_text_documents(TEXT_FOLDER)
for idx, (fname, content) in enumerate(texts.items()):
emb = embed_text_ollama(content)
text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
# Image Collection
if CHROMA_IMAGE_COLLECTION in [c.name for c in client.list_collections()]:
client.delete_collection(name=CHROMA_IMAGE_COLLECTION)
image_collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)
images = load_image_paths(IMAGE_FOLDER)
for idx, path in enumerate(images):
emb = embed_image_ollama(path)
image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
####rag/refresh.py
导入操作系统
导入 chromadb
from .embedding_utils import embed_text_ollama, embed_image_ollama
从 .config 导入 *
from .loaders import load_text_documents, load_image_paths
def refresh_text_index():
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
如果 CHROMA_TEXT_COLLECTION 在 [c.name for c in client.list_collections()]:
client.delete_collection(CHROMA_TEXT_COLLECTION)
collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)
texts = load_text_documents(TEXT_FOLDER)
for idx, (fname, content) in enumerate(texts.items()):
emb = embed_text_ollama(content)
collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
print(f"文本索引已刷新,包含{len(texts)}个文档。")
def refresh_image_index():
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
如果 CHROMA_IMAGE_COLLECTION 在 [c.name for c in client.list_collections()]:
client.delete_collection(CHROMA_IMAGE_COLLECTION)
collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)
images = load_image_paths(IMAGE_FOLDER)
for idx, path in enumerate(images):
emb = embed_image_ollama(path)
collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
print(f"图片索引已刷新,包含{len(images)}张图片。")
def refresh_all_indexes():
refresh_text_index()
refresh_image_index()
####rag/refresh.py
import os
import chromadb
from .embedding_utils import embed_text_ollama, embed_image_ollama
from .config import *
from .loaders import load_text_documents, load_image_paths
def refresh_text_index():
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
if CHROMA_TEXT_COLLECTION in [c.name for c in client.list_collections()]:
client.delete_collection(CHROMA_TEXT_COLLECTION)
collection = client.create_collection(name=CHROMA_TEXT_COLLECTION)
texts = load_text_documents(TEXT_FOLDER)
for idx, (fname, content) in enumerate(texts.items()):
emb = embed_text_ollama(content)
collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
print(f" Text index refreshed with {len(texts)} documents.")
def refresh_image_index():
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
if CHROMA_IMAGE_COLLECTION in [c.name for c in client.list_collections()]:
client.delete_collection(CHROMA_IMAGE_COLLECTION)
collection = client.create_collection(name=CHROMA_IMAGE_COLLECTION)
images = load_image_paths(IMAGE_FOLDER)
for idx, path in enumerate(images):
emb = embed_image_ollama(path)
collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
print(f"Image index refreshed with {len(images)} images.")
def refresh_all_indexes():
refresh_text_index()
refresh_image_index()
####rag/reranker.py
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, metadatas):
pairs = [(query, doc.get("file", "")) for doc in metadatas]
scores = cross_encoder.predict(pairs)
排名 = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)
返回 [doc for doc, _ in ranking]
####rag/reranker.py
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, metadatas):
pairs = [(query, doc.get("file", "")) for doc in metadatas]
scores = cross_encoder.predict(pairs)
ranked = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked]
###rag/generation.py
from langchain_community.llms import Ollama
from .config import MODEL_NAME
def get_llm():
返回 Ollama(model=MODEL_NAME, temperature=0.2)
###rag/generation.py
from langchain_community.llms import Ollama
from .config import MODEL_NAME
def get_llm():
return Ollama(model=MODEL_NAME, temperature=0.2)
##### run_once.py
from rag.index_builder import build_index
如果 __name__ == "__main__":
构建索引()
##### run_once.py
from rag.index_builder import build_index
if __name__ == "__main__":
build_index()
以下是不同类型的脚本:
The following are the different kinds of scripts:
####运行刷新.py
from rag.refresh import refresh_all_indexes
如果 __name__ == "__main__":
refresh_all_indexes()
####run_refresh.py
from rag.refresh import refresh_all_indexes
if __name__ == "__main__":
refresh_all_indexes()
|
注意:您可以创建自己的用户界面;以下是一个示例。 |
#### app.py
import streamlit as st
导入操作系统
导入 chromadb
from rag.embedding_utils import embed_text_ollama, embed_image_ollama
from rag.reranker import rerank
from rag.config import *
from rag.generation import get_llm
from rag.refresh import refresh_all_indexes
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
st.title("多模态 RAG 笔记本电脑助手")
mode = st.radio("选择模式", ["图片 → 规格", "图片 + 文字 → 规格", "文字 → 图片 + 规格", "文字 → 生成的答案"])
如果 st.button("刷新索引"):
refresh_all_indexes()
st.success("索引刷新成功!")
# ...(此处内容与第 8 章和第 9 章的 `app.py` 查询处理逻辑相同)...
此设置将自适应索引刷新集成到您现有的多模态 RAG 管道中。
|
Note: You may create your own UI; the following is a sample example. |
#### app.py
import streamlit as st
import os
import chromadb
from rag.embedding_utils import embed_text_ollama, embed_image_ollama
from rag.reranker import rerank
from rag.config import *
from rag.generation import get_llm
from rag.refresh import refresh_all_indexes
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
st.title("Multimodal RAG Laptop Assistant")
mode = st.radio("Choose Mode", ["Image → Specs", "Image + Text → Specs", "Text → Image + Specs", "Text → Generated Answer"])
if st.button("Refresh Indexes"):
refresh_all_indexes()
st.success("Indexes refreshed successfully!")
# ... (same content as the Chapter_8 and 9 `app.py` query handling logic here) ...
This setup integrates adaptive index refresh into your existing multimodal RAG pipeline.
这种综合方法通过集成前沿的表示和检索技术,增强了多模态 RAG 流水线的功能。通过结合自适应索引刷新、多向量嵌入和统一向量数据库,该系统能够高效、精确地处理各种输入模态和查询类型。通过精心编排文本和图像嵌入,以及运用复杂的重排序技术,该架构为构建高级 AI 助手奠定了坚实的基础。以下章节将详细介绍该系统的底层流水线、存储架构和检索机制。
This comprehensive approach enhances the capabilities of your multimodal RAG pipeline by integrating cutting-edge methods for both representation and retrieval. By combining adaptive index refresh, multi-vector embeddings, and a unified vector database, the system is able to handle a wide range of input modalities and query types with efficiency and precision. Through careful orchestration of text and image embeddings, as well as sophisticated reranking techniques, the architecture serves as a robust foundation for building advanced AI assistants. The following sections further detail the underlying pipeline, storage architecture, and retrieval mechanisms that power this system.
完整的端到端代码可在本书的 GitHub 代码库中找到。请参考第六章“两阶段和多阶段 GenAI 系统”中列出的多向量表示概念。该系统使用 Qdrant 将密集文本嵌入和多向量文本嵌入以及图像嵌入集成到一个统一的向量数据库中。它支持多模态检索和词元级重排序,并利用自适应嵌入刷新机制来确保数据一致性。该架构展示了一个具有后期交互、多模态上下文和局部 LLM 推理的混合 RAG 流水线的实际应用示例。
The end-to-end code can be found in the GitHub repository of the book. Please refer to the multi-vector representation concept listed in Chapter 6, Two and Multi-stage GenAI Systems. This system integrates dense and multi-vector text embeddings along with image embeddings into a unified vector database using Qdrant. It supports multimodal retrieval and token-level reranking and leverages an adaptive embedding refresh mechanism to ensure data consistency. The architecture exemplifies a practical implementation of a hybrid RAG pipeline with late interaction, multimodal context, and local LLM reasoning.
下图展示了一个稳健的 RAG 流水线,旨在高效处理和响应用户使用文本和图像输入的查询。该系统通过为每种模态利用专用嵌入模型,并将它们存储在统一的向量数据库中,从而支持混合语义搜索和检索。定期刷新索引可确保新导入的文档和图像及时更新到数据库中。检索结果在传递给 LLM 进行最终答案生成之前,会经过基于多向量的重排序,从而实现准确且上下文相关的多模态响应。
The following figure illustrates a robust RAG pipeline designed to efficiently process and respond to user queries using both text and image inputs. By leveraging dedicated embedding models for each modality and storing them in a unified vector database, the system supports hybrid semantic search and retrieval. Periodic index refreshes ensure that newly ingested documents and images are reflected in the database. Retrieved results undergo multi-vector-based reranking before being passed to an LLM for final answer generation, enabling accurate and context-aware multimodal responses.
该系统为每个文本和图像配对文档生成并存储三种类型的矢量表示:
The system generates and stores three types of vector representations for each paired text and image document:
这种设计形成了一个统一的向量存储,支持多模态检索(文本和图像)和多向量重排序(标记级精度)。
This design results in a unified vector store that supports multimodal retrieval (text and image) and multi-vector reranking (token-level precision).
为了兼顾速度和准确性,检索过程分为两个阶段:
The retrieval process is divided into two-stages to balance speed and accuracy:
prefetch = models.Prefetch(query=dense_query, using="dense_text")
prefetch = models.Prefetch(query=dense_query, using="dense_text")
查询=colbert_query,
使用="colbert_text",
query=colbert_query,
using="colbert_text",
选定排名靠前的文档后,提取其文本内容并将其连接起来形成上下文字符串。然后,使用 LangChain 中的 ReAct 式提示将此上下文连同原始查询一起传递给本地 LLM(通过 Ollama 连接到 Mistral):
Once the top-ranked documents are selected, their textual content is extracted and concatenated to form a context string. This context is passed, along with the original query, to a local LLM (Mistral via Ollama) using a ReAct-style prompt in LangChain:
Python
python
编辑
CopyEdit
response = chain.run({"query": query_text, "context": context})
response = chain.run({"query": query_text, "context": context})
LLM 综合上下文并返回自然语言响应。
The LLM synthesizes the context and returns a natural language response.
该系统包含一个自适应刷新功能,可以扫描指定的文本和图像文件夹。它检测有效的.txt和.jpg文件对,生成所有必要的嵌入,并将它们插入到 Qdrant 中。
The system includes an adaptive refresh function that scans a specified text and images folder. It detects valid .txt and .jpg file pairs, generates all necessary embeddings, and upserts them into Qdrant.
该过程具有以下适应性:
This process is adaptive in the following ways:
这种刷新机制确保 Qdrant 与最新数据集保持同步,使其适用于文档定期更改的环境(例如,每周更新)。
This refresh mechanism ensures that Qdrant stays up-to-date with the latest dataset, making it suitable for environments where documents change regularly (e.g., weekly updates).
该集合针对每种向量类型配置了不同的索引策略:
The collection is configured with different indexing strategies per vector type:
这种配置对于依赖快速预选和准确后期交互重排序的两阶段检索系统来说是最佳的。
This configuration is optimal for a two-stage retrieval system that relies on fast preselection and accurate late interaction reranking.
完成自适应索引刷新集成后,我们鼓励您通过实施其他检索优化技术来扩展此系统。首先,可以加入查询扩展功能,利用同义词或释义来提高召回率。其次,添加基于模态的路由,以便根据输入类型动态地将查询定向到相应的索引。第三,在相似度比较之前实现嵌入归一化,并尝试使用加权嵌入融合来平衡多模态输入。第四,集成分数融合和聚合功能,以便合并来自多个来源的结果。最后,利用时间戳或可靠性等元数据来增强上下文过滤。这些新增功能将显著提高系统的相关性、鲁棒性和适应性。
After completing the adaptive index refresh integration, you are encouraged to extend this system by implementing additional retrieval optimization techniques. Start by incorporating query expansion to improve recall using synonyms or paraphrasing. Add modality-based routing to dynamically direct queries to the appropriate index based on input type. Implement embedding normalization before similarity comparisons and experiment with weighted embedding fusion to balance multimodal inputs. Integrate score fusion and aggregation for combining results from multiple sources. Finally, enhance contextual filtering using metadata such as timestamps or reliability. These additions will significantly improve the relevance, robustness, and adaptability of your system.
本章全面概述了检索优化技术,着重探讨了传统和现代检索系统的根本缺陷。我们研究了诸如基于模态的路由、查询扩展、分数融合和自适应索引刷新等针对性策略如何缓解这些限制。通过详细的设计原则和模块化的Python实现,我们演示了如何实现自适应索引刷新。我们展示了一个功能齐全的代码库,其中包含ChromaDB、CLIP嵌入和Streamlit接口,最终构成了一个自适应索引流水线。读者现在掌握了概念理解和实用工具,可以将此框架扩展到实际应用中,并添加其他优化技术。下一章,我们将实现以语音为输入的多模态GenAI系统。
This chapter provided a comprehensive overview of retrieval optimization techniques, addressing the fundamental drawbacks of traditional and modern retrieval systems. We explored how targeted strategies, such as modality-based routing, query expansion, score fusion, and adaptive index refresh, mitigate these limitations. Through detailed design principles and modular Python implementations, we demonstrated how to implement adaptive index refresh. A fully functional codebase featuring ChromaDB, CLIP embeddings, and a Streamlit interface was presented, culminating in an adaptive indexing pipeline. Readers are now equipped with both conceptual understanding and practical tools to extend this framework with additional optimization techniques for real-world applications. In the next chapter, we will implement multimodal GenAI systems with voice as input.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
本章探讨如何将语音作为主要输入模式添加到多模态生成式人工智能(GenAI )系统中。传统上,这类系统依赖文本或视觉输入,但如今,语音输入正日益受到重视,以增强可访问性、自然交互和用户参与度。图示流程展示了一个无缝衔接的过程:用户通过键盘或语音输入的查询会依次经过检索增强生成(RAG )聊天机器人。语音输入在集成前会进行语音转文本(STT )转换。系统随后会检查向量数据库以查找相关上下文。如果找到,则将上下文传递给 Mistral大型语言模型(LLM )以生成答案。如果未找到,流程会动态地回退到网络搜索,为响应合成提供足够的依据。最后,生成的答案可以选择性地转换为语音,从而以语音输出完成多模态闭环。该架构凸显了 GenAI 界面的日益完善,它将语音、文本、检索和生成整合到一个强大且以用户为中心的交互模型中。
This chapter explores adding speech as a primary input mode to multimodal generative AI (GenAI) systems. Traditionally reliant on text or visual input, such systems are increasingly embracing voice to enhance accessibility, natural interaction, and user engagement. The illustrated pipeline introduces a seamless flow where user queries—via keyboard or voice—are routed through a retrieval-augmented generation (RAG) chatbot. Voice input undergoes a speech-to-text (STT) transformation before integration. The system then checks a vector database for relevant context. If found, the context is passed to a Mistral large language model (LLM) for answer generation. If not, the pipeline dynamically falls back on a web search to provide sufficient grounding for response synthesis. Finally, generated answers are optionally converted to speech, closing the multimodal loop with a voice-based output. This architecture highlights the growing sophistication of GenAI interfaces, unifying speech, text, retrieval, and generation into a robust, user-centric interaction model.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在设计并实现一个支持语音的多模态 RAG 系统,该系统集成了语音、文档检索、网络搜索回退以及基于本地 LLM 的响应生成功能。通过同时启用语音转文本 (STT) 和文本转语音( TTS ) 功能,该系统旨在创建一个更自然、更易用且更贴合上下文的对话式界面。该解决方案结合了模块化的 LangChain 组件、LangGraph 编排、Ollama 托管的 LLM 以及 Streamlit UI,能够从本地 PDF 文件或网络生成符合语境的响应。本章将展示语音如何作为高级 RAG 架构中的主要输入方式,从而提升其在实际多模态应用中的可用性。
The objective of this chapter is to design and implement a voice-enabled multimodal RAG system that integrates speech, document retrieval, web search fallback, and local LLM-based response generation. By enabling both STT and text-to-speech (TTS) capabilities, the system aims to create a more natural, accessible, and context-aware conversational interface. The solution combines modular LangChain components, LangGraph orchestration, Ollama-hosted LLMs, and Streamlit UI to deliver grounded responses from local Portable Document Format (PDFs) or the web. This chapter demonstrates how speech can serve as a primary input modality in advanced RAG architectures, enhancing usability across real-world, multimodal applications.
RAG 已成为将语言学习模型 (LLM) 与外部知识相结合的强大范式。早期的 RAG 系统主要在文本领域运行,随后发展到整合图像等视觉模态,而近期的进展则要求将其扩展到更广泛的模态。一个真正的多模态 RAG 系统能够整合多种数据类型,包括但不限于音频(语音)、视频、传感器数据、表格输入和结构化知识图谱,从而实现跨领域更丰富、更具上下文关联性的生成。
RAG has emerged as a powerful paradigm for grounding LLMs in external knowledge. While early implementations of RAG systems predominantly operated in the textual domain and subsequently evolved to incorporate visual modalities such as images, recent advancements call for an expansion toward a broader spectrum of modalities. A truly multimodal RAG system integrates diverse data types, including but not limited to audio (speech), video, sensor data, tabular inputs, and structured knowledge graphs, enabling richer and more contextually grounded generation across domains.
在这个更广泛的多模态 RAG 框架中,查询可以来自各种输入通道:语音(转换为文本)、手势(通过姿态估计进行解释)、空间数据(通过激光雷达( LiDAR ) 或物联网( IoT ) 传感器获取)或实时环境中的用户交互(例如,增强现实和虚拟现实( AR/VR ) 设置)。因此,检索机制必须处理异构索引结构——嵌入数据库、图数据库或结构化数据仓库,每种结构都代表不同的模态特定嵌入空间。这需要模态感知检索器或跨模态对齐技术来确保语义一致的检索。
In this broader multimodal RAG framework, queries can originate from various input channels: speech (converted to text), gesture (interpreted via pose estimation), spatial data (via light detection and ranging (LiDAR) or Internet of Things (IoT) sensors), or user interactions in real-time environments (e.g., augmented reality and virtual reality (AR/VR) settings). The retrieval mechanism must therefore operate over heterogeneous index structures—embedding databases, graph databases, or structured warehouses, each representing a different modality-specific embedding space. This requires either modality-aware retrievers or cross-modal alignment techniques to ensure semantically coherent retrieval.
生成模块通常由基础模型(例如 Mistral、GPT 或 Gemini)驱动,然后利用注意力加权后期融合、嵌入拼接或上下文评分等融合技术,整合这些检索到的上下文信息(可能跨模态)。此类架构支持从多模态对话代理和智能辅导系统到物理环境中的自主代理等各种应用。因此,将 RAG 扩展到图像-文本融合之外,为将 LLM 应用于复杂的现实世界信息生态系统开辟了新的领域。
The generation module, typically powered by a foundation model (e.g., Mistral, GPT, or Gemini), then integrates these retrieved contexts, potentially across modalities, using fusion techniques such as attention-weighted late fusion, embedding concatenation, or contextual scoring. Such architectures enable applications ranging from multimodal conversational agents and intelligent tutoring systems to autonomous agents in physical environments. Thus, expanding RAG beyond image-text fusion unlocks new frontiers for grounding LLMs in complex, real-world information ecosystems.
在查询时,用户查询被编码成一个向量,并使用近似最近邻(ANN )搜索将其与存储的文档嵌入进行比较,从而检索出最相似的前k个候选文档。这些向量搜索结果随后被转发到一个交叉编码器重排序器,该重排序器联合处理原始查询和每个候选文档,通过完整的词元级交互来计算更细粒度的相似度得分。重排序器基于语义相关性对结果进行重新排序,从而生成一组更准确的前k个重排序文档。
At query time, the user query is encoded into a vector and compared against the stored document embeddings using approximate nearest neighbor (ANN) search, retrieving the top-k most similar candidates. These vector search results are then forwarded to a cross-encoder reranker, which jointly processes the original query and each candidate document to compute fine-grained similarity scores via full token-level interaction. The reranker reorders the results based on semantic relevance, producing a more accurate set of top-k reranked documents.
这些重新排序后的文档连同原始用户查询一起被送入逻辑逻辑模型(LLM)进行合成。LLM生成最终答案并返回给用户。这种两阶段设计兼顾了可扩展性(通过双编码器检索)和精确性(通过交叉编码器重排序),从而实现了高效且高质量的响应生成。
These reranked documents, along with the original user query, are passed into the LLM for synthesis. The LLM generates the final answer, which is returned to the user. This two-stage design balances scalability (via bi-encoder retrieval) with precision (via cross-encoder reranking), resulting in both efficient and high-quality response generation.
语音转文本 (STT) 和文本转语音 (TTS) 技术是开发语音多模态人工智能系统的基础组件。这些技术通过语音输入和音频输出实现自然语言交互,显著提升了系统的可访问性、免提操作性和用户参与度,尤其是在视觉或触觉输入受限的环境中。
STT and TTS technologies serve as foundational components in the development of voice-enabled multimodal AI systems. By enabling natural language interaction through spoken input and auditory output, these technologies significantly enhance accessibility, hands-free operation, and user engagement, especially in environments where visual or tactile input may be constrained.
让我们来了解一下语音对话系统的核心组件:语音转录 (STT) 和文本转语音 (TTS)。这些技术构建了人类语音和机器智能之间的双向桥梁,实现了自然、直观的交互。STT 将语音输入转录成机器可读的文本,作为下游 AI 模型的听觉入口。TTS 则赋予系统响应以语音,将生成的文本合成出类似人类的语音。它们共同实现了现代对话式 AI 流程中无缝的端到端语音交互。详情如下:
Let us understand the core components that power voice-enabled conversational systems STT and TTS. These technologies form the bidirectional bridge between human speech and machine intelligence, enabling natural, intuitive interactions. STT transcribes spoken input into machine-readable text, acting as the auditory gateway to downstream AI models. TTS, in turn, gives voice to the system’s responses, synthesizing human-like speech from generated text. Together, they enable seamless, end-to-end voice interaction in modern conversational AI pipelines. Details are as follows:
STT 和 TTS 共同构成一个闭环反馈系统,将用户语音转换为可操作的机器输入,并提供合成语音输出,从而完成对话式人工智能中的听觉界面循环。
Together, STT and TTS form a closed feedback loop, converting user speech into actionable machine input and delivering synthesized voice output, thereby completing the auditory interface cycle in conversational AI.
将语音合成(STT)和文本转语音(TTS)技术集成到 RAG 流程中,扩展了生成式系统的功能,使其超越了传统的基于文本的界面,从而实现更自然、多模态的人机交互。这种语音增强技术在虚拟助手、辅助功能系统以及在真实环境中运行的具身人工智能代理等应用中尤为重要。
The integration of STT and TTS technologies into RAG pipelines extends the capabilities of generative systems beyond traditional text-based interfaces, enabling more natural and multimodal human-computer interaction. This voice augmentation is particularly impactful in applications such as virtual assistants, accessibility systems, and embodied AI agents operating in real-world environments.
在支持语音的 RAG 流程中,STT 模块作为入口点,将用户的语音输入转录为结构化文本。转录后的文本随后被路由至核心 RAG 流程,用于针对向量数据库或其他知识库进行语义检索。检索到的文档与输入查询连接起来,并传递给语言学习模型(LLM),例如 Mistral、GPT 或 Llama,由其生成上下文相关的响应。
In a voice-enabled RAG pipeline, STT modules serve as the entry point, transcribing spoken user input into structured text. This transcribed text is then routed through the core RAG pipeline, where it is used to perform semantic retrieval against a vector database or other knowledge source. Retrieved documents are concatenated with the input query and passed to an LLM, such as Mistral, GPT, or Llama, which generates a contextually grounded response.
在生成响应之后,输出阶段采用TTS系统,根据LLM的文本输出合成自然语音。这就完成了基于语音的交互循环,以人耳可听懂的方式提供对话式响应。
Following response generation, TTS systems are employed at the output stage to synthesize natural speech from the textual output of the LLM. This completes the voice-based interaction loop, delivering conversational responses in a human-auditory format.
这种双向语音集成不仅提升了用户体验,也带来了延迟、流式推理和实时纠错方面的挑战。解决这些问题需要精心协调异步输入/输出( I/O )、快速的语音转文本/语音合成 (STT/TTS) 推理引擎,以及用于低置信度语音识别或生成输出的回退机制。
This bidirectional speech integration not only enhances user experience but also introduces challenges in latency, streaming inference, and real-time error correction. Addressing these requires careful orchestration of asynchronous input/output (I/O), fast STT/TTS inference engines, and fallback mechanisms for low-confidence speech recognition or generation outputs.
下图展示了一个多模态语音 RAG 管道,该管道旨在处理智能问答( QA ) 系统中的键盘和语音输入:
The following figure illustrates a multimodal voice-enabled RAG pipeline designed to handle both keyboard and voice inputs in an intelligent question answering (QA) system:
Figure 11.1: Voice-enabled multimodal RAG pipeline integrating speech
该流程始于用户输入,用户可以通过键盘输入或语音输入。对于语音查询,系统首先进行语音转文本 (STT) 转换,将语音转录为文本形式。无论输入方式如何,问题路由模块都会确保所有输入均被规范化,并以统一格式发送到下游。
The process begins with user input, which can be entered either through a keyboard or captured via voice. For spoken queries, the system first performs STT conversion, transcribing the spoken words into textual form. Regardless of input modality, the question routing module ensures that all inputs are normalized and sent downstream in a unified format.
接下来,RAG聊天机器人会处理查询,它会执行向量数据库查找,以检查相关的上下文知识是否已嵌入并可检索。如果找到上下文,则会将其直接传递给Mistral LLM,后者使用该上下文生成基于语境的响应。如果未找到相关上下文,系统默认采用基于网络搜索的检索方式,以确保LLM仍然能够获得足够的语境信息。
Next, the query is handled by the RAG chatbot, which performs a vector database lookup to check whether relevant contextual knowledge is already embedded and retrievable. If context is found, it is passed directly to the Mistral LLM, which uses this context to generate a grounded response. If no relevant context is located, the system defaults to web search-based retrieval, ensuring the LLM still receives sufficient grounding information.
Mistral LLM 生成的答案可以选择性地使用 TTS 合成技术转换回音频,从而提供与原始输入模态一致的语音输出。这种闭环流程展示了现代多模态系统如何集成检索、生成和语音技术,以提供直观、易用的对话式 AI 体验。
The generated answer, produced by the Mistral LLM, can then optionally be converted back into audio using TTS synthesis, providing a spoken output that aligns with the original input modality. This closed-loop pipeline exemplifies how modern multimodal systems can integrate retrieval, generation, and speech technologies to deliver intuitive, accessible conversational AI experiences.
图 11.2展示了一个支持语音的多模态 RAG 聊天机器人系统的高级项目目录结构。该架构将语言建模、向量检索、提示工程和语音处理等关键功能模块化。这种组织结构支持可扩展性,并实现了清晰的职责分离,涵盖了从文档导入和嵌入到实时语音交互和前端部署的各个环节。
Figure 11.2 presents the high-level project directory structure for a voice-enabled multimodal RAG chatbot system. The architecture modularizes key functionalities such as language modeling, vector retrieval, prompt engineering, and voice processing. This organization supports extensibility and clear separation of concerns, ranging from document ingestion and embedding to real-time speech interaction and frontend deployment.
该系统利用精心设计的技术栈,旨在支持模块化、本地优先和语音支持的 RAG 工作流程,如下表所示:
The system leverages a carefully curated technology stack designed to support modular, local-first, and speech-enabled RAG workflows, as outlined in the following table:
|
成分 Component |
描述 Description |
|
朗链 LangChain |
作为 RAG 的骨干,它支持在模块化管道中实现提示模板、文档加载和 LLM 编排。 Serves as the backbone for RAG, enabling prompt templating, document loading, and LLM orchestration in a modular pipeline. |
|
LangGraph LangGraph |
提供基于图的执行模型,用于管理条件流(例如,回退到 Web 搜索)和查询路径的动态路由。非常适合管理复杂的查询状态。 Provides a graph-based execution model to manage conditional flows (e.g., fallback to web search) and dynamic routing of query paths. Ideal for managing complex query states. |
|
奥拉玛 Ollama |
支持本地LLM(例如Mistral或Llama),无需外部应用程序编程接口(API )调用即可实现快速离线推理。支持自定义模型集成和GPU加速。 Hosts local LLMs such as Mistral or Llama, enabling fast, offline inference without external application programming interface (API) calls. Supports custom model integration and GPU acceleration. |
|
Streamlit Streamlit |
为基于 Web 的前端用户界面提供支持,使用户能够通过简洁、响应迅速的界面与聊天机器人进行交互。支持实时语音和文本输入。 Powers the web-based frontend UI, enabling users to interact with the chatbot via a clean, reactive interface. Supports real-time voice and text inputs. |
|
Tavily API Tavily API |
当在本地向量数据库中找不到相关上下文时,可作为实时网络搜索的备用方案,确保响应始终基于最新的外部知识。 Acts as a live web search fallback when no relevant context is found in the local vector database, ensuring responses remain grounded in up-to-date external knowledge. |
|
诺米克嵌入 Nomic Embeddings |
用于将导入的文档转换为适合在向量数据库中进行相似性搜索的高维向量表示。 Used to convert ingested documents into high-dimensional vector representations suitable for similarity search in the vector database. |
|
pyttsx3 pyttsx3 |
可在客户端启用 TTS 转换,以完全离线、平台无关的方式从 LLM 输出生成可听见的响应。 Enables TTS conversion on the client side, generating audible responses from LLM outputs in a fully offline, platform-agnostic manner. |
|
语音识别 SpeechRecognition |
使用本地麦克风流捕获语音输入并将其转录为文本,充当系统的 STT 引擎。 Captures and transcribes voice input into text using local microphone streams, acting as the system’s STT engine. |
表 11.1:支持语音的多模态 RAG 聊天机器人的关键组件
Table 11.1: Key components of the voice-enabled multimodal RAG chatbot
该集成堆栈支持端到端的多模态对话式 AI 流水线,能够进行本地推理、动态检索、实时语音交互和回退增强。
This integrated stack supports an end-to-end multimodal conversational AI pipeline that is capable of local inference, dynamic retrieval, real-time speech interaction, and fallback augmentation.
该系统采用基于 Streamlit 的极简前端,用户可以通过键盘输入或实时语音查询与多模态 RAG 聊天机器人进行交互。界面会显示转录的语音,动态检索相关上下文,并提供带有来源信息的可靠答案。
The system features a minimalist Streamlit-based frontend that enables users to interact with the multimodal RAG chatbot using either keyboard input or real-time voice queries. The interface displays transcribed speech, dynamically retrieves relevant context, and presents grounded answers with source attribution.
如下图所示,前端界面是一个使用 Streamlit 实现的多模态 RAG 聊天机器人,它为用户提供基于 Web 的用户界面( UI ),用户可以通过键盘或语音与系统进行交互。脚本 ( app.py ) 集成了各种后端组件。在模块化、实时工作流程中,对用户输入、检索、生成和语音功能进行模块化和协调。
The frontend interface, as shown in the following figure, is a multimodal RAG chatbot that is implemented using Streamlit, providing users with a web-based user interface (UI) to interact with the system via either keyboard or voice. The script (app.py) integrates various backend modules and coordinates user input, retrieval, generation, and speech functionalities in a modular, real-time workflow.
为了解支持语音的多模态 RAG 聊天机器人的核心执行流程,本文将介绍这款基于 Streamlit 的应用。该应用集成了本地 LLM 推理、语音处理和动态文档检索功能。从环境设置和模块导入到通过键盘或麦克风进行实时交互,整个流程协调 LLM 调用、基于图的推理和语音合成,从而提供流畅的用户体验。以下我们将详细介绍该系统中实现多模态交互的主要步骤:
To understand the core execution flow of the voice-enabled multimodal RAG chatbot. This Streamlit-based application integrates local LLM inference, speech processing, and dynamic document retrieval. From environment setup and module imports to real-time interaction via keyboard or microphone, the pipeline orchestrates LLM invocation, graph-based reasoning, and speech synthesis to deliver a seamless user experience. Here, we break down the major steps that enable multimodal interaction in this system:
1.环境设置和导入:脚本首先解析模块路径并导入依赖项:
1. Environment setup and imports: The script begins by resolving the module path and importing dependencies:
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), ".")))
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
from rag.ollama_llm import get_llm
from rag.ollama_llm import get_llm
from rag.graph_workflow import graph
from rag.graph_workflow import graph
from rag.voice import listen_from_microphone, speak_text
from rag.voice import listen_from_microphone, speak_text
此设置确保可以访问内部 RAG 模块和封装的逻辑,用于 LLM 调用 ( get_llm() )、基于图的推理 ( graph.invoke() ) 和语音交互。
This setup ensures access to internal RAG modules and encapsulated logic for LLM invocation (get_llm()), graph-based reasoning (graph.invoke()), and speech interaction.
2.页面初始化:Streamlit 页面通过标题和布局进行初始化:
2. Page initialization: The Streamlit page is initialized with a title and layout:
st.set_page_config(page_title="RAG 语音聊天机器人", layout="wide")
st.set_page_config(page_title="RAG Chatbot with Voice", layout="wide")
st.title("多模态 RAG 聊天机器人(PDF + Web + 语音)")
st.title("Multimodal RAG Chatbot (PDF + Web + Voice)")
3.输入模式选择:用户可以通过单选按钮选择键盘输入或语音输入:
3. Input mode selection: A radio button allows users to choose between Keyboard and Voice input:
input_mode = st.radio("选择输入法:", ["键盘", "语音"], horizontal=True)
input_mode = st.radio("Choose input method:", ["Keyboard", "Voice"], horizontal=True)
4.键盘交互流程:对于文本输入,查询通过st.text_input()字段提交,并在单击“询问”按钮后进行处理:
4. Keyboard interaction flow: For text input, the query is submitted via a st.text_input() field and processed upon clicking the "Ask" button:
query = st.text_input("输入您的问题:")
query = st.text_input("Type your question:")
如果 query.strip() 且 st.button("询问"):
if query.strip() and st.button("Ask"):
with st.spinner("思考中..."):
with st.spinner("Thinking..."):
状态 = graph.invoke({...})
state = graph.invoke({...})
graph.invoke ()函数控制 RAG 管道,并在必要时检索文档和 Web 内容。提示信息会根据检索到的上下文动态构建:
The graph.invoke() function controls the RAG pipeline, retrieving documents and web content if necessary. The prompt is constructed dynamically using retrieved context:
提示 = f"""{前缀}
prompt = f"""{prefix}
你是一位得力的助手。请仅使用以下上下文中的信息……
You are a helpful assistant. Use ONLY the information in the CONTEXT below...
"""
"""
响应由本地 Ollama 托管的 LLM 生成,并通过 st.markdown(...) 返回给用户,同时还会转换为语音:
The response is generated by the local Ollama-hosted LLM and returned to the user via st.markdown(...), while also being converted to speech:
response = llm.invoke([HumanMessage(content=prompt)])
response = llm.invoke([HumanMessage(content=prompt)])
speak_text(final_answer)
speak_text(final_answer)
5.语音交互流程:在“语音”模式下,点击“说出你的问题”会触发实时麦克风捕捉:
5. Voice interaction flow: In the "Voice" mode, clicking "Speak your question" triggers real-time microphone capture:
查询 = listen_from_microphone()
query = listen_from_microphone()
采集到的语音通过SpeechRecognition模块进行转录。转录后的查询遵循与文本相同的逻辑路径:经过检索、提示组装、生成和最终响应渲染。同样,speak_text(final_answer)确保音频输出:
The captured voice is transcribed via the SpeechRecognition module. The transcribed query follows the same logic path as text: passing through retrieval, prompt assembly, generation, and final response rendering. Again, speak_text(final_answer) ensures audio output:
st.success(f"您说的是:{query}")
st.success(f"You said: {query}")
final_answer = response.content.strip()
final_answer = response.content.strip()
speak_text(final_answer)
speak_text(final_answer)
引入异常处理机制是为了报告运行时错误:
Exception handling is incorporated to report runtime errors:
除异常 e 外:
except Exception as e:
st.error(f"语音错误:{e}")
st.error(f" Voice error: {e}")
该前端协调了一个完整的多模态循环,接收语音/文本输入,使用 LangGraph 进行检索,通过 LangChain 调用 LLM,并以视觉和听觉两种方式返回响应。其模块化和清晰的设计使其非常适合具有多模态功能的实时 RAG 交互。
This frontend orchestrates a complete multimodal loop, accepting voice/text, performing retrieval with LangGraph, invoking LLMs via LangChain, and returning responses both visually and auditorily. The modularity and clarity of design make it well-suited for real-time RAG interactions with multimodal capabilities.
要理解多模态 RAG 聊天机器人的内部运作机制,必须探究其核心 Python 组件的模块化方式。rag /目录下的每个脚本都负责一项特定功能,涵盖文档导入、向量索引、提示构建、查询路由和 LLM 推理等。以下解释 该系统构建了对这些模块按逻辑执行顺序的理解,重点阐述了它们如何协作以实现端到端的语音和网络搜索的 RAG(资源可用性)功能。系统首先加载 PDF 文档(loaders.py ),并将其转换为向量嵌入(embeddings.py ),这些向量嵌入存储在向量数据库(vectorstore.py )中。当用户提出查询时,系统首先尝试在本地检索相关文档。如果找不到上下文,则使用 Tavily(tavily_search.py )查询网络。router.py模块负责在这两个来源之间进行选择。然后,上下文被格式化为结构化提示(prompts.py ),并使用 Ollama( ollama_llm.py )传递给本地 LLM(语言学习管理)。执行流程由graph_workflow.py使用 LangGraph 进行管理,而utils.py 则支持整个流程中的格式化和预处理。
To understand the internal workings of the multimodal RAG chatbot, it is essential to explore how the system is modularized across its core Python components. Each script in the rag/ directory is responsible for a specific function, ranging from document ingestion and vector indexing to prompt construction, query routing, and LLM inference. The following explanation builds an understanding of these modules in a logical execution order, highlighting how they collaborate to enable end-to-end RAG with voice and web search capabilities. The system begins by loading PDF documents (loaders.py) and transforming them into vector embeddings (embeddings.py), which are stored in a vector database (vectorstore.py). When a user query arrives, the system first tries to retrieve relevant documents locally. If no context is found, it queries the web using Tavily (tavily_search.py). The router.py module decides between these two sources. Context is then formatted into a structured prompt (prompts.py) and passed to a local LLM using Ollama (ollama_llm.py). The execution flow is managed by graph_workflow.py using LangGraph, while utils.py supports formatting and preprocessing throughout the pipeline.
本节详细阐述了语音多模态 RAG 框架的基础组件和运行逻辑。它系统地探讨了文档预处理流程、嵌入策略、矢量索引机制、基于 LLM 的查询路由以及图驱动的控制流,从而展示了一个用于扎根于现实的语音集成信息检索的统一架构。
This section delineates the foundational components and operational logic underlying the voice-enabled multimodal RAG framework. It systematically explores the document preprocessing pipeline, embedding strategies, vector indexing mechanisms, LLM-based query routing, and graph-driven control flow, thereby illustrating a cohesive architecture for grounded, speech-integrated information retrieval.
PDF 加载和分块load_pdfs()函数如下:
PDF loading and chunking load_pdfs() functions are as follows:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
导入操作系统
def load_pdfs ( folder_path ):
文档 = []
splitter = RecursiveCharacterTextSplitter(chunk_size= 1000 , chunk_overlap= 200 )
for filename in os.listdir(folder_path):
如果文件名以“pdf”结尾:
loader = PyPDFLoader(os.path.join(folder_path, filename))
文档 = loader.load()
documents.extend(splitter.split_documents(docs))
返回单据
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
def load_pdfs(folder_path):
documents = []
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
for filename in os.listdir(folder_path):
if filename.endswith(".pdf"):
loader = PyPDFLoader(os.path.join(folder_path, filename))
docs = loader.load()
documents.extend(splitter.split_documents(docs))
return documents
from langchain_nomic.embeddings import NomicEmbeddings
此导入语句加载NomicEmbeddings包装器,它为nomic-embed-text-v1.5模型提供了一个标准的 LangChain 兼容接口,该模型是一个针对语义搜索和文档检索任务优化的高性能嵌入模型。
def get_embeddings():
返回 NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")
通过将嵌入逻辑封装在get_embeddings()中,该系统确保了即插即用的架构,无需修改下游代码即可轻松替换模型或更改配置。
from langchain_nomic.embeddings import NomicEmbeddings
This import statement loads the NomicEmbeddings wrapper, which provides a standard LangChain-compatible interface to the nomic-embed-text-v1.5 model, a high-performance embedding model optimized for semantic search and document retrieval tasks.
def get_embeddings():
return NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local")
By encapsulating the embedding logic in get_embeddings(), the system ensures a plug-and-play architecture, facilitating easy model replacement or configuration changes without modifying downstream code.
from langchain_community.vectorstores import SKLearnVectorStore
此导入语句引入了LangChain 社区模块中的SKLearnVectorStore实现。它特别适用于不需要持久化或大规模向量存储(例如 Qdrant、Faiss)的本地原型开发环境。
def create_vectorstore(docs, embedding):
返回 SKLearnVectorStore.from_documents(docs, embedding)
from langchain_community.vectorstores import SKLearnVectorStore
This import brings in the SKLearnVectorStore implementation from LangChain’s community module. It is particularly well-suited for local, prototyping environments where persistent or large-scale vector storage (e.g., Qdrant, Faiss) is not required.
def create_vectorstore(docs, embedding):
return SKLearnVectorStore.from_documents(docs, embedding)
from_documents ()方法执行两个操作:
The from_documents() method performs two operations:
由于SKLearnVectorStore具有内存特性,因此非常适合开发和测试,但可能无法扩展到生产环境,因为生产环境需要持久化或分布式索引(例如,通过 Qdrant 或 Pinecone)。
Due to its in-memory nature, SKLearnVectorStore is ideal for development and testing but may not scale for production use where persistent or distributed indexing (e.g., via Qdrant or Pinecone) is needed.
导入操作系统
导入请求
from langchain.schema import Document
from dotenv import load_dotenv
该函数从 LangChain 导入环境变量、网络请求工具和文档模式,以保持与 RAG 架构其余部分的兼容性。
load_dotenv()
api_key = os.getenv("TAVILY_API_KEY")
url = "https://api.tavily.com/search"
headers = {"Authorization": f"Bearer {api_key}"}
有效载荷 = {"查询": 查询, "结果数": 最大结果数}
强大的异常处理机制可确保在发生网络错误或出现意外响应时实现优雅降级:
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
data = response.json()
results = data.get("results", [])
return [Document(page_content=entry["content"]) for entry in results if "content" in entry]
返回 [Document(page_content="Tavily 返回了未结果。")]
这使得实时知识增强成为可能,对于静态文档语料库无法涵盖的、对时间敏感的、基于事实的查询尤其重要。
import os
import requests
from langchain.schema import Document
from dotenv import load_dotenv
This function imports environment variables, network request tools, and the Document schema from LangChain to maintain compatibility with the rest of the RAG architecture.
load_dotenv()
api_key = os.getenv("TAVILY_API_KEY")
url = "https://api.tavily.com/search"
headers = {"Authorization": f"Bearer {api_key}"}
payload = {"query": query, "num_results": max_results}
Robust exception handling ensures graceful degradation in case of network errors or unexpected responses:
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
data = response.json()
results = data.get("results", [])
return [Document(page_content=entry["content"]) for entry in results if "content" in entry]
return [Document(page_content="Tavily returned no results.")]
This enables live knowledge augmentation, especially critical for time-sensitive, fact-based queries not covered by static document corpora.
从 langchain.schema 导入 HumanMessage、SystemMessage
from rag.ollama_llm import get_llm
from rag.prompts import router_instructions
from rag.utils import safe_json_parse
此设置导入了与 LangChain 兼容的消息模式、LLM 接口、预定义的路由指令(作为系统提示)以及用于强大的 JSON 解析的实用函数。
def route_question_and_get_source(question: str) -> str:
消息 = [
SystemMessage(content=router_instructions),
HumanMessage(内容=问题)
]
response = llm_json_mode.invoke(messages)
该功能向LLM发送两轮对话:
这种方法利用 LLM 驱动的控制流,其中模型返回结构化的 JSON 输出,指示首选数据源。
result = safe_json_parse(response.content)
datasource = result.get("datasource", "vectorstore").lower()
如果数据源是“websearch”,则返回“web”,否则返回“pdf”。
如果发生故障(例如,解析错误或意外内容),系统默认使用本地文档矢量存储:
除异常 e 外:
print(f"[ROUTER] JSON 解析失败:{e}")
返回“pdf”
该设计在响应来源方面引入了动态适应性,无需手动基于规则的路由即可确保更高的响应相关性。
from langchain.schema import HumanMessage, SystemMessage
from rag.ollama_llm import get_llm
from rag.prompts import router_instructions
from rag.utils import safe_json_parse
This setup imports LangChain-compatible message schemas, an LLM interface, pre-defined routing instructions as system prompts, and a utility function for robust JSON parsing.
def route_question_and_get_source(question: str) -> str:
messages = [
SystemMessage(content=router_instructions),
HumanMessage(content=question)
]
response = llm_json_mode.invoke(messages)
The function sends a two-turn conversation to the LLM:
This approach leverages LLM-driven control flow, where the model returns a structured JSON output indicating the preferred data source.
result = safe_json_parse(response.content)
datasource = result.get("datasource", "vectorstore").lower()
return "web" if datasource == "websearch" else "pdf"
In the event of a failure (e.g., parsing error or unexpected content), the system defaults to using the local document vectorstore:
except Exception as e:
print(f"[ROUTER] JSON parsing failed: {e}")
return "pdf"
This design introduces dynamic adaptability in response sourcing, ensuring higher response relevance without manual rule-based routing.
router_instructions = """
您就像一台路由器,决定是使用用户的私有 PDF 文件还是通过网络搜索来回答问题。
如果问题与 LangChain、提示工程或其他文档中涵盖的主题有关 → 请使用“vectorstore”。
如果问题与近期事件、天气、人物、地点或现实世界数据有关 → 请使用“网络搜索”。
仅返回类似这样的 JSON:
{ "数据源": "网络搜索" }
或者
{ "数据源": "向量存储" }
"""
rag_prompt = """
你是一位得力的助手。请仅使用以下上下文中的信息来回答问题。
如果上下文中没有包含答案,请直接回答:
“根据上下文,我无法判断。”
---
语境:
{语境}
---
问题:
{问题}
---
指示:
- 请勿使用先前的知识。
- 请勿编造答案。
- 仅使用上述上下文中的信息。
请用2-3句简洁的句子回答。
- 如果不确定,就说“根据上下文,我不知道”。
---
回答:
"""
这些提示共同确保了智能路由和接地生成,这是可信的 RAG 系统的两个基本组成部分。
router_instructions = """
You are a router deciding whether a question should be answered using the user's private PDFs or from a web search.
If the question is about LangChain, prompt engineering, or other topics covered in the provided documents → use 'vectorstore'.
If the question is about recent events, weather, people, locations, or real-world data → use 'websearch'.
Return ONLY a JSON like:
{ "datasource": "websearch" }
or
{ "datasource": "vectorstore" }
"""
rag_prompt = """
You are a helpful assistant. Use ONLY the information in the CONTEXT below to answer the QUESTION.
If the CONTEXT does not contain the answer, respond exactly with:
"I don’t know based on the context."
---
CONTEXT:
{context}
---
QUESTION:
{question}
---
INSTRUCTIONS:
- Do NOT use prior knowledge.
- Do NOT make up any answers.
- ONLY use information in the context above.
- Answer in 2–3 concise sentences.
- Say “I don’t know based on the context” if unsure.
---
Answer:
"""
Together, these prompts ensure both intelligent routing and grounded generation, two foundational components of trustworthy RAG systems.
from langchain_community.chat_models import ChatOllama
导入引用了社区支持的ChatOllama集成,该集成将 LangChain 与通过 Ollama 运行时提供的模型连接起来,Ollama 运行时是一个流行的框架,用于在 CPU 或 GPU 上本地运行轻量级 LLM。
def get_llm():
返回 ChatOllama(model="mistral")
通过将模型实例化包装在get_llm()中,该设计遵循依赖注入最佳实践,允许轻松替换模型(例如,从 Mistral 切换到 Llama 2),而无需更改核心逻辑。
from langchain_community.chat_models import ChatOllama
The import references the community-supported ChatOllama integration, which connects LangChain with models served via the Ollama runtime, a popular framework for running lightweight LLMs locally on CPU or GPU.
def get_llm():
return ChatOllama(model="mistral")
By wrapping the model instantiation inside get_llm(), the design follows dependency injection best practices, allowing easy substitution of models (e.g., switching from Mistral to Llama 2) without changing the core logic.
llm = get_llm()
docs = load_pdfs("data/documents")
retriever = create_vectorstore(docs, get_embeddings()).as_retriever(search_kwargs={"k": min(3, len(docs))})
class GraphState(TypedDict):
问题:str
生成:字符串
web_search: str
max_retries: int
答案:int
loop_step: Annotated[int, operator.add]
文档:List[Document]
以下列表定义了图节点之间传递的状态变量:
检索(状态)
def retrieve(state):
返回 {
"文档": retriever.invoke(state["问题"]),
"web_search": "否"
}
web_search(state)
def web_search(state):
docs = search_tavily(state["question"])
返回 {
“文档”:docs,
"web_search": "是"
}
生成(状态)
def generate(state):
# 如果来自网络,请进行汇总
# 将上下文格式化为 rag_prompt
# 致电LLM获取答案
如果 state["web_search"] == "No" 则返回 "generate",否则返回 "websearch"
llm = get_llm()
docs = load_pdfs("data/documents")
retriever = create_vectorstore(docs, get_embeddings()).as_retriever(search_kwargs={"k": min(3, len(docs))})
class GraphState(TypedDict):
question: str
generation: str
web_search: str
max_retries: int
answers: int
loop_step: Annotated[int, operator.add]
documents: List[Document]
The following list defines the state variables passed between graph nodes:
retrieve(state)
def retrieve(state):
return {
"documents": retriever.invoke(state["question"]),
"web_search": "No"
}
web_search(state)
def web_search(state):
docs = search_tavily(state["question"])
return {
"documents": docs,
"web_search": "Yes"
}
generate(state)
def generate(state):
# summarize if from web
# format context into rag_prompt
# call LLM to get answer
Return "generate" if state["web_search"] == "No" else "websearch"
工作流 = 状态图(GraphState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)
workflow.add_node("websearch", web_search)
workflow.add_node(“grade_documents”, grade_documents)
定义节点以及它们之间的数据流。`set_conditional_entry_point ()`函数允许图根据 LLM 路由动态选择其起始节点(检索 或网络搜索)。
workflow.add_conditional_edges("generate", grade_generation_v_documents_and_question, {
“有用”:结束,
“无用”: “网络搜索”,
不支持:生成,
“最大重试次数”:结束
})
返回 workflow.compile()
graph = build_graph()
workflow = StateGraph(GraphState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)
workflow.add_node("websearch", web_search)
workflow.add_node(“grade_documents”, grade_documents)
Defines the nodes and how data flows between them. The set_conditional_entry_point() allows the graph to dynamically choose its starting node (retrieve or websearch) based on LLM routing.
workflow.add_conditional_edges("generate", grade_generation_v_documents_and_question, {
"useful": END,
"not useful": "websearch",
"not supported": "generate",
"max retries": END
})
return workflow.compile()
graph = build_graph()
它通过结构化的、可扩展的管道,确保动态、上下文感知和适应性执行,支持重试、回退和摘要。
It ensures dynamic, context-aware, and adaptable execution, supporting retries, fallback, and summarization through a structured, extensible pipeline.
强大的 JSON 解析工具:safe_json_parse()函数是一个容错工具,旨在从 LLM 生成的文本中提取和解析 JSON 格式的内容。LLM 甚至当被要求返回结构化数据时,有时会生成额外的自然语言输出或格式错误的 JSON。此实用程序可确保下游组件接收到干净、机器可读的 JSON 对象,从而维护查询路由等自动化工作流程的可靠性。详情如下:
Robust JSON parsing utility: The safe_json_parse() function is a fault-tolerant utility designed to extract and parse JSON-formatted content from LLM-generated text. LLM, even when prompted to return structured data, can sometimes produce additional natural language output or malformed JSON. This utility ensures that downstream components receive clean, machine-readable JSON objects, thereby maintaining reliability in automated workflows such as query routing. The details are as follows:
导入 json
导入 re
def safe_json_parse(text):
尝试:
match = re.search(r'{.*?}', text.strip(), re.DOTALL)
如果匹配:
返回 json.loads(match.group(0))
别的:
raise ValueError("LLM 输出中未找到 JSON")
除异常 e 外:
raise ValueError(f"JSON 解析失败:{e}\n原始文本:\n{text}")
import json
import re
def safe_json_parse(text):
try:
match = re.search(r'{.*?}', text.strip(), re.DOTALL)
if match:
return json.loads(match.group(0))
else:
raise ValueError("No JSON found in LLM output")
except Exception as e:
raise ValueError(f"JSON parsing failed: {e}\nRaw Text:\n{text}")
match = re.search( r'{.*?}' , text.strip(), re.DOTALL)
match = re.search(r'{.*?}', text.strip(), re.DOTALL)
返回 json.loads(match.group(0))
return json.loads(match.group(0))
raise ValueError (f"JSON 解析失败:{e}\n原始文本:\n{text}")
safe_json_parse()作为防御性编程层,可以降低因 LLM 响应格式错误或噪声过大而导致下游崩溃的风险,从而实现更可靠、更适用于生产环境的管道。
raise ValueError(f"JSON parsing failed: {e}\nRaw Text:\n{text}")
By acting as a defensive programming layer, safe_json_parse() mitigates the risk of downstream crashes due to malformed or noisy LLM responses, enabling a more reliable and production-ready pipeline.
from rag.loaders import load_pdfs
from rag.embeddings import get_embeddings
from rag.vectorstore import create_vectorstore
这些导入以模块化的方式封装了文档加载(load_pdfs() )、嵌入初始化(get_embeddings() )和向量索引创建(create_vectorstore() )。
def main():
print(" 📚从 data/documents 加载文档...")
docs = load_pdfs("data/documents")
print(f"已加载 {len(docs)} 个文档。")
print(" 🧠创建矢量商店...")
vectorstore = create_vectorstore(docs, get_embeddings())
print(f"已将 {len(docs)} 个文档嵌入到向量存储中。")
如果 __name__ == "__main__":
主要的()
运行此脚本:
python run_once.py
在部署前端或调用 LangGraph 工作流之前,填充您的矢量存储库。
from rag.loaders import load_pdfs
from rag.embeddings import get_embeddings
from rag.vectorstore import create_vectorstore
These imports modularly encapsulate document loading (load_pdfs()), embedding initialization (get_embeddings()), and vector index creation (create_vectorstore()).
def main():
print("📚 Loading documents from data/documents...")
docs = load_pdfs("data/documents")
print(f"Loaded {len(docs)} documents.")
print("🧠 Creating vectorstore...")
vectorstore = create_vectorstore(docs, get_embeddings())
print(f"Embedded {len(docs)} documents into vectorstore.")
if __name__ == "__main__":
main()
Run this script:
python run_once.py
To populate your vectorstore before deploying the frontend or invoking the LangGraph workflow.
该系统的初衷很简单:以语音作为主要输入方式来增强 RAG(红绿灯)系统,使交互更加自然、便捷且以用户为中心。通过精心设计的模块化方案,该项目最终发展成为一个强大的多模态 AI 助手,能够读取本地文档、智能路由查询、通过矢量搜索或网络回退获取上下文信息,并利用本地部署的 LLM(逻辑逻辑模型)生成可靠且有理有据的回复。
This system began with a simple premise: to augment RAG with voice as a primary input modality, making interaction more natural, accessible, and user-centric. Through careful modular design, the project evolved into a robust multimodal AI assistant, capable of ingesting local documents, intelligently routing queries, retrieving context via vector search or web fallback, and generating grounded, reliable responses using a locally hosted LLM.
我们首先启用了语音合成输入和文本转语音输出,将人声无缝集成到 RAG 反馈回路中。然后,我们引入了基于图的编排层(使用 LangGraph),实现了诸如网页内容摘要、查询重试以及优雅地处理文档覆盖缺口等条件流程。
We started by enabling STT input and TTS output, seamlessly integrating human voice into the RAG feedback loop. We then introduced a graph-based orchestration layer using LangGraph, allowing conditional flows such as summarizing web content, retrying queries, and gracefully handling document coverage gaps.
每个 Python 模块都是专门设计的:loaders.py用于文档导入,embeddings.py用于向量化,router.py用于基于 LLM 的源路由,graph_workflow.py用于状态驱动控制。我们使用 Streamlit 构建了一个简洁而高效的前端,使用户能够通过语音或文本与后端执行流程保持一致。
Each Python module was purpose-built; loaders.py for document ingestion, embeddings.py for vectorization, router.py for LLM-based source routing, and graph_workflow.py for state-driven control. A minimalist yet effective frontend was built using Streamlit, allowing users to interact via voice or text with a consistent backend execution flow.
该系统不仅展现了语音赋能的 RAG 架构的潜力,也为进一步扩展到图像、视频或实时多模态应用奠定了基础。它高效、合乎伦理且交互式地弥合了人类沟通与基于现实的 AI 推理之间的鸿沟。完整的代码可在Chapter_11 的 code.Zip 文件中获取。
This system not only showcases the potential of voice-enabled RAG architectures but also provides a foundation for further extension into image, video, or real-time multimodal applications. In doing so, it bridges the gap between human communication and grounded AI reasoning—efficiently, ethically, and interactively. The end-to-end code is available at Chapter_11, code.Zip.
本章探讨了 RAG 系统如何超越传统的图像和文本输入,重点关注语音作为核心模态的集成。我们考察了支持语音的多模态 RAG 流水线的概念和架构基础,详细阐述了语音文本转文本 (STT) 和文本转语音 (TTS) 界面如何增强自然交互。该系统能够动态地在本地向量搜索和基于 Web 的检索之间路由查询,确保响应真实可靠且具有上下文感知能力。我们还剖析了完整的实现过程——从文档导入到基于 LangGraph 的编排和前端部署,展示了模块化代码设计如何支持实时语音驱动的 AI 体验。这些组件共同展示了语音如何增强 RAG 系统,从而实现更丰富、更易用的应用。
This chapter explored the evolution of RAG systems beyond traditional image and text inputs, emphasizing the integration of voice as a core modality. We examined the conceptual and architectural foundations of a voice-enabled multimodal RAG pipeline, detailing how STT and TTS interfaces can enhance natural interaction. The system dynamically routes queries between local vector search and web-based retrieval, ensuring grounded, context-aware responses. We also dissected the full implementation—from document ingestion to LangGraph-based orchestration and frontend deployment, demonstrating how modular code design supports real-time, speech-driven AI experiences. Together, these components illustrate how voice augments RAG systems for richer, more accessible applications.
下一章深入探讨了推理和重排序技术,并深入分析了它们在提高 RAG 系统中的响应质量方面的作用。
The following chapter delves into reasoning and reranking techniques, offering insights into their roles in enhancing response quality within RAG systems.
随着生成式人工智能(GenAI )的不断发展,仅仅检索和生成内容的能力已远远不够。真正智能的系统必须能够推理,解读文本和图像等多种模态,并从众多可能的输出中选择最佳答案。本章将通过介绍思维链(CoT )提示结合重排序,拓展多模态生成式人工智能的边界,使您的模型能够逐步思考并做出明智的选择。
As generative AI (GenAI) continues to evolve, the ability to simply retrieve and generate content is no longer enough. Truly intelligent systems must be able to reason, interpret diverse modalities like text and images, and select the best response from many possible outputs. This chapter pushes the boundaries of multimodal GenAI by introducing you to chain of thoughts (CoT) prompting combined with reranking, enabling your models to think step-by-step and choose wisely.
本章将探讨如何构建模型不仅能响应指令,还能进行深思熟虑的系统。您将学习如何引导模型完成明确的推理步骤,整合来自检索文档和图像信息的上下文,然后应用多遍重排序,根据质量、相关性或特定任务约束来优化答案。
In this chapter, you will explore how to architect systems where models do not just respond but rather deliberate. You will learn to guide models through explicit reasoning steps, integrating context from both retrieved documents and image-based information, and then apply multi-pass reranking to refine answers based on quality, relevance, or task-specific constraints.
通过使用 LangChain、Ollama 和自定义 CoT 模板进行实践,您将构建统一的多模态流程,使文本和图像信号融合,从而支持稳健的决策。课程内容包括少样本 CoT 策略、动态提示构建和上下文感知重排序,最终目标是开发功能强大的、推理增强的多模态应用程序。
Through hands-on implementation using LangChain, Ollama, and custom CoT templates, you will build unified multimodal flows where text and image signals converge to support robust decision-making. Topics include few-shot CoT strategies, dynamic prompt construction, and context-aware reranking, all of which culminate in the development of powerful, reasoning-augmented multimodal applications.
在本章结束时,您将构建一个复杂的 GenAI 系统,该系统能够执行视觉问答( QA )、多模态文档分析和逐步分析。逐步进行上下文决策,为下一代人工智能铺平道路,使其能够像检索信息一样进行推理。
By the end of this chapter, you will have constructed a sophisticated GenAI system capable of performing visual question answering (QA), multimodal document analysis, and step-by-step contextual decision-making, paving the way for next-generation AI that can reason as well as it retrieves.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章侧重理论,旨在帮助您理解GenAI系统中推理的核心概念。本章将探讨推理对于构建智能、可靠且可解释的AI模型至关重要的原因。我们将考察各种推理类型,包括演绎推理、归纳推理、溯因推理、类比推理、常识推理、因果推理、数学推理、空间推理、时间推理、工具推理和多模态推理,并解释每种推理方式如何提升性能和决策能力。您还将学习诸如CoT提示和推理行动(ReAct )代理等现代技术如何使模型能够逐步进行推理。这些基础知识将为您在后续章节中设计和实现功能更强大、更具上下文感知能力的AI系统奠定基础。
This chapter is theoretical in nature, so that you can understand the core concepts of reasoning in GenAI systems. It explores why reasoning is essential for building intelligent, reliable, and explainable AI models. We examine various types of reasoning, including deductive, inductive, abductive, analogical, commonsense, causal, mathematical, spatial, temporal, tool-based, and multimodal reasoning, and explain how each contributes to improved performance and decision-making. You will also learn how modern techniques like CoT prompting and reasoning and acting (ReAct) agents enable models to reason step-by-step. This foundation will prepare you to design and implement more capable and context-aware AI systems in later chapters.
基因人工智能(GenAI)已从生成文本、代码或图像迅速发展到支持复杂的决策任务。这一发展的核心在于推理能力的整合,即模型不仅能够生成输出,还能理解、规划和解释这些输出。随着基因人工智能的应用领域扩展到多模态领域和高风险环境,推理能力成为区分反应式模型和可靠智能系统的关键因素。
GenAI has rapidly evolved from generating text, code, or images to supporting complex decision-making tasks. At the core of this evolution lies the integration of reasoning capabilities, the ability of a model to not just generate outputs, but to understand, plan, and explain them. As the landscape of GenAI applications expands into multimodal domains and high-stakes environments, reasoning becomes the differentiating factor that transforms a reactive model into a reliable, intelligent system.
大多数传统的基因人工智能模型依赖于表面模式识别。给定一个提示,它们会根据统计概率生成响应。虽然这对于简单的任务(例如,撰写电子邮件、生成诗歌)很有效,但在以下情况下往往力不从心:
Most traditional GenAI models rely on surface-level pattern recognition. Given a prompt, they generate a response based on statistical likelihood. While this is effective for simple tasks (e.g., drafting an email, generating a poem), it often falls short in scenarios where:
推理弥补了这一不足,它使模型能够进行“思考”,评估中间步骤,模拟决策,并对结果进行解释。这在处理复杂的多跳查询时至关重要(例如,在员工年薪超过 9 万美元的部门中,哪个部门的预算最高?)。如果没有推理,模型可能会猜测或跳过某些步骤;而有了推理,它就可以分解查询,识别子任务,并按顺序解决它们。
Reasoning fills this gap by enabling models to think out loud, evaluating intermediate steps, simulating decisions, and justifying outcomes. This is essential when working with complex, multi-hop queries (e.g., which department has the highest budget among those where employees earn over $90,000?). Without reasoning, the model may guess or skip steps; with reasoning, it can break down the query, identify subtasks, and solve them sequentially.
GenAI(基因人工智能)应用的关键挑战在于信任。在商业、法律、医疗和教育领域,用户不仅要求系统正确,还要求系统能够解释其决策。推理能力通过以下方式提高可解释性:
A key challenge in GenAI adoption is trust. In business, law, medicine, and education, users demand systems that are not only correct but also explain their decisions. Reasoning improves explainability by:
例如,在法律文件分析中,GenAI模型不仅应该概括合同条款,还应该逐步解释为什么某个条款被认为存在风险。这种程度的问责性只有通过推理才能实现。
For example, in legal document analysis, a GenAI model should not only summarize a contract clause but explain why a clause is considered risky, step-by-step. This level of accountability is only possible through reasoning.
在现实世界的语言和视觉任务中,歧义很常见。同一个词根据上下文可能指代不同的事物(例如,苹果公司和苹果水果;员工姓名和部门名称)。推理能够实现以下功能:
In real-world language and vision tasks, ambiguity is common. The same term may refer to different things based on context (Apple as a company vs. a fruit; a name in an employee vs. a department table). Reasoning enables the following:
在多模态GenAI中,这一点变得尤为重要。例如,如果模型要回答关于图表或图像的问题,它必须将视觉线索与文本意图相结合,并运用逻辑来推断用户的真正意图。
In multimodal GenAI, this becomes even more critical. For example, if a model is answering a question about a chart or an image, it must combine visual cues with textual intent and use logic to infer what the user likely means.
GenAI 的真正实力在于其处理多模态输入的能力——文本、图像、文档、代码、表格,甚至音频。然而,这些模态具有不同的结构和语义。推理对于以下方面至关重要:
The true power of GenAI lies in its ability to handle multimodal inputs—text, images, documents, code, tables, and even audio. However, these modalities come with diverse structures and semantics. Reasoning is essential for:
如果一个模型只能嵌入和检索多模态数据,而不能跨格式推理和推断缺失的链接,那么这个模型就走不远。
A model that simply embeds and retrieves multimodal data cannot go far without the capacity to reason across formats and infer missing links.
诸如 CoT 和 ReAct 之类的高级提示策略是 GenAI 中推理的直接实现。这些提示鼓励模型:
Advanced prompting strategies like CoT and ReAct are direct implementations of reasoning in GenAI. These prompts encourage the model to:
例如,在将自然语言查询转换为 SQL 时,支持 CoT 的模型可以首先推断哪些表和列是相关的,然后再构建查询。这显著提高了正确性并减少了错误。
For instance, when converting a natural language query into SQL, a CoT-enabled model can first reason about which tables and columns are relevant, then construct the query. This dramatically improves correctness and reduces hallucinations.
此外,少样本CoT提示实验表明,即使是大型语言模型(LLM )也能从逐步推理的示例中获益。这与人类学习过程相呼应,并强化了推理不仅仅是一种技巧,而是一种能够提升表现的认知支架这一观点。
Moreover, few-shot CoT prompting shows that even large language models (LLMs) benefit from seeing examples of step-by-step reasoning. This mirrors human learning and reinforces the idea that reasoning is not just a technique; it is a cognitive scaffolding that improves performance.
在实践中,GenAI 系统通常会生成多个候选答案。如果没有推理,选择最佳答案就变得随意或基于嵌入。通过推理和重排序,系统可以:
In practice, GenAI systems often generate multiple candidate responses. Without reasoning, choosing the best one becomes arbitrary or embedding-based. With reasoning + reranking, systems can:
这种元推理,即对已生成响应进行推理,对于减少幻觉和提高可靠性至关重要。它在高风险决策系统中尤为重要,例如医疗保健或金融领域的AI助手。
This meta-reasoning, reasoning about generated responses, is critical for reducing hallucinations and improving reliability. It is especially important in high-stakes decision-making systems, such as AI assistants in healthcare or finance.
推理能力有助于模型泛化到新的任务。经过推理训练的模型通常可以:
Reasoning helps models generalize to novel tasks. A model trained to reason can often:
缺乏推理能力,该模型只能进行表面记忆。而有了推理能力,它就能开始接近解决问题的智能——这是真正通用人工智能的标志。
Without reasoning, the model is limited to surface memorization. With reasoning, it begins to approximate problem-solving intelligence—a hallmark of true general-purpose AI.
推理不仅有助于机器,也有助于与机器协作的人类。当 GenAI 模型解释其步骤时,用户可以:
Reasoning not only helps the machine; it also helps the human collaborating with it. When GenAI models explain their steps, users can:
这在人工智能辅助领域专家的副驾驶场景中尤为有用。例如,在数据科学领域,能够推理完成探索性数据分析(EDA )步骤的GenAI代理可以帮助分析师在保持控制的同时加快发现速度。
This is particularly useful in co-pilot scenarios where AI assists a domain expert. For example, in data science, a GenAI agent that can reason through exploratory data analysis (EDA) steps helps analysts speed up discovery while staying in control.
随着我们迈向智能体系统——即能够自主规划、行动和反思的人工智能体——推理成为其基础。这些智能体必须:
As we move toward agentic systems—AI agents that plan, act, and reflect autonomously—reasoning becomes the foundation. These agents must:
这些行为都依赖于推理层。没有推理层,智能体就只是随机试错的机器;而有了推理层,它们就能成为适应性强、智能化的助手。
Every one of these actions depends on a reasoning layer. Without it, agents are random trial-and-error engines, but with reasoning, they become adaptable, intelligent assistants.
在人工智能系统日益融入工作流程、决策制定和用户交互的时代,推理不再是可有可无的,而是至关重要的。它将被动的生成器转变为主动的问题解决者,为人工智能交互带来清晰性、准确性、适应性和可信度。无论是通过认知能力提示、反应循环还是多模态推理链,这种能力都使人工智能能够处理模糊情况、规划行动、解释决策并与人类进行有意义的协作。
In an era where GenAI systems are increasingly integrated into workflows, decision-making, and user interactions, reasoning is not optional; it is essential. It transforms passive generators into active problem-solvers. It brings clarity, accuracy, adaptability, and trustworthiness into AI interactions. Whether through CoT prompting, ReAct loops, or multimodal reasoning chains, this capability enables AI to handle ambiguity, plan actions, explain decisions, and collaborate meaningfully with humans.
随着我们构建更先进的GenAI系统,推理成为生成与智能之间的桥梁。而跨越这座桥梁,便能开启人工智能的下一个前沿领域。
As we build more advanced GenAI systems, reasoning is the bridge between generation and intelligence. And crossing that bridge is what unlocks the next frontier of AI.
人类人工智能(GenAI)系统,尤其是语言学习模型(LLM)和人工智能代理,其设计目标正日益转向深入思考问题,而不仅仅是生成表面文本。现代语言学习模型,例如GPT-4和PaLM,能够模仿各种推理模式(从严格的逻辑到常识),从而得出结论或做出决策。然而,尽管它们在模式识别和流畅模仿方面表现出色,但真正的推理(逻辑地连接信息、推断未知事实、解决新问题)仍然是一个挑战。研究人员正积极利用诸如CoT提示、ReAct代理和多模态融合架构等技术来增强语言学习模型的推理能力,以使人工智能的思维更接近人类,更具鲁棒性。下一节将详细概述人类人工智能中的关键推理类型,包括它们的定义、示例以及当前系统如何实现它们以提高性能、决策能力、消除歧义和增强鲁棒性。
GenAI systems, especially LLMs and AI agents, are increasingly being designed to think through problems rather than just produce surface-level text. Modern LLMs like GPT-4 and PaLM can mimic various reasoning patterns (from strict logic to commonsense) to draw conclusions or make decisions. However, while they excel at pattern recognition and fluent imitation, true reasoning (logically connecting information, inferring unseen facts, solving novel problems) is still a challenge. Researchers are actively enhancing LLM reasoning via techniques like CoT prompting, ReAct agents, and multimodal fusion architectures to make AI’s thinking more human-like and robust. An overview of key types of reasoning in GenAI has been thoroughly explained in the following section, and it covers what they are, examples of each, and how current systems implement them to improve performance, decision-making, disambiguation, and robustness.
演绎推理是指从一般前提或规则中得出具体且逻辑上确定的结论的过程。如果给定的前提为真,则演绎结论也必然为真。例如,从“所有鲸鱼都是哺乳动物”和“虎鲸是鲸鱼”这两个前提,演绎系统可以得出“虎鲸是哺乳动物”的结论。逻辑学习模型(LLM)可以通过遵循“如果……那么……”的规则并执行逐步推理来模拟演绎逻辑。在实践中,认知能力训练(CoT)的提示通常会培养一种演绎推理风格;模型会被提示将问题分解成逻辑步骤,并系统地推导出答案。这对于形式逻辑谜题或算术等任务非常有效,因为在这些任务中,答案必然由前提推导而来。通过显式地生成中间步骤,LLM 的答案更有可能在逻辑上有效,并且可以追溯到输入的事实,从而提高其在数学证明或代码推理等领域的可靠性。演绎推理通过确保结论与给定事实一致,有助于做出稳健的决策,从而减少对正确性要求严格的任务中的错误。
Deductive reasoning is the process of drawing specific, logically certain conclusions from general premises or rules. If the given premises are true, a deductive conclusion must also be true. For example, from all whales are mammals and Orca is a whale, a deductive system concludes Orca is a mammal. LLMs can emulate deductive logic by following if-then rules and performing step-by-step inference. In practice, CoT prompting often instils a deductive style; the model is prompted to break a problem into logical steps and derive the answer systematically. This has been effective for tasks like formal logic puzzles or arithmetic, where the solution follows inevitably from the premises. By explicitly generating intermediate steps, an LLM’s answer is more likely to be logically valid and traceable to the input facts, which improves reliability in domains like math proofs or code reasoning. Deductive reasoning contributes to robust decision-making by ensuring conclusions are consistent with given facts, reducing mistakes in tasks that demand rigorous correctness.
在 GenAI 中实现的CoT提示是一种直接激发演绎思维的方法。例如,给定一个数学应用题或逻辑谜题,像 GPT-4 这样的模型会被鼓励列出前提并推断每个步骤,最终得出答案,这与证明过程非常相似。这种方法显著提高了多步骤逻辑和数学任务的准确率。一些神经符号系统甚至将逻辑推理模型 (LLM) 与自动定理证明器相结合,以复核演绎步骤,融合统计推理和形式推理,从而提高推理的严谨性。
Implemented in GenAI: CoT prompts are a direct way to elicit deductive thinking. For instance, given a math word problem or a logical riddle, models like GPT-4 are encouraged to list premises and infer each step before finalizing the answer, much like a proof. This method significantly boosts accuracy on multi-step logic and math tasks. Some neuro-symbolic systems even combine LLMs with automated theorem provers to double-check deductive steps, blending statistical and formal reasoning for extra rigor.
归纳推理是指从具体实例或证据概括出更广泛的规则或结论。其结果是概率性的而非必然的;本质上是从示例中学习模式。用人类的例子来说,如果你观察过去 10 次在添加某个补丁后成功的代码构建,你可能会归纳地得出结论:这个补丁通常可以修复构建问题。LLM (逻辑学习模型)由于其训练方式而具有强大的归纳推理能力:它们会吸收数百万个示例并学习预测模式。提示中的小样本学习就是一个典型的例子;LLM(学习学习模型)被赋予少量输入/输出(I/O )示例(具体案例),并推断出适用于新查询的通用模式。LLM中的上下文学习通常被描述为归纳推理,因为模型从提示示例中抽象出规则,并将其扩展到解决新的实例。这有助于提高泛化能力和适应性。例如,如果向模型展示几个格式化的日期转换示例,模型可以推断出格式化规则,并在无需显式编程的情况下转换新的日期。归纳推理提高了创造性思维和模式识别能力,但也可能引入不确定性——结论看似合理但并非确定,因此模型有时必须通过额外的检查来验证归纳猜测。
Inductive reasoning involves generalizing from specific instances or evidence to broader rules or conclusions. The outcome is probable rather than guaranteed; it is essentially pattern learning from examples. In human terms, if you observe the past 10 code builds that succeeded after adding a certain patch, you might inductively conclude that this patch generally fixes the build. LLMs are inherently strong inductive reasoners because of the way they are trained: They ingest millions of examples and learn to predict patterns. Few-shot learning in prompts is a prime example; an LLM is given a handful of input/output (I/O) examples (specific cases), and it infers the general pattern to apply to a new query. In-context learning in LLMs is often described as inductive reasoning, as the model abstracts a rule from the prompt examples and extends it to solve a novel instance. This contributes to generalization and adaptability. For example, if shown a couple of formatted date conversions, the model can induce the formatting rule and convert a new date without explicit programming. Inductive reasoning improves creative generation and pattern recognition, but it can also introduce uncertainty—the conclusions are plausible but not certain, so models must sometimes verify inductive guesses with additional checks.
在 GenAI 中实现:LLM 主要通过从数据中学习和少量提示来实现归纳。归纳并非一种特殊的提示技术,而是模型在大量文本上训练并适应给定示例的自然副产品。例如,GPT 类型的模型可以从少量示例推断出列表排序规则或语法模式,并继续进行,展现出归纳泛化能力。自洽性技术可以通过让模型生成多个合理的答案,然后选择最常见或最一致的答案来增强归纳能力,从而有效地考虑多个归纳假设并选择最佳答案。
Implemented in GenAI: LLMs implement induction largely via learning from data and few-shot prompting. Rather than a special prompting technique, induction is a natural by-product of training on vast text and adjusting to given examples. For instance, GPT-style models can infer a list sorting rule or grammatical pattern from a few demonstrations and then continue it, showcasing inductive generalization. Self-consistency techniques can augment induction by having the model-generate multiple plausible answers and then choose the most common or consistent one, effectively considering several inductive hypotheses and selecting the best.
溯因推理是指在不完整的观察结果下,推断出一个合理的假设,从而得出最佳解释。侦探常用的推理方法就是溯因推理。例如,如果我们看到窗边有脚印,保险箱又开着,那么最佳解释就是入室盗窃。与演绎推理不同,溯因推理的结论并不能保证正确,它们只是基于已有知识的推测。逻辑逻辑模型(LLM)可以在需要填补信息空白或推断隐藏原因的任务中运用溯因推理。例如,给定一个不完整的故事,人工智能可以推测出最能解释某个角色行为的动机。溯因推理对于常识推理和故障排除非常有用,因为在这些情况下,可能存在多种解释,系统必须选择最有可能的解释。在GenAI中,实现溯因推理的一种方法是使用“提出并验证”的CoT(认知能力验证)。模型首先提出一个假设,然后在内部检查该假设是否符合现有证据。研究表明,LLM(逻辑逻辑模型)可以从这种方法中获益:例如,将多项选择题视为溯因推理任务,先假设一个答案,然后检验其在上下文中的合理性,通常能得到更好的结果。当直接推理困难时,人类会自然而然地转向溯因推理,而LLM智能体也开始模仿这种灵活性。通过融入溯因推理,人工智能系统能够更好地应对歧义,因为它们可以处理不完整的信息,并仍然提供合理的解决方案。
Abductive reasoning is reasoning to the best explanation, forming a plausible hypothesis given incomplete observation. It is the kind of reasoning a detective uses. If we see footprints by the window and the safe open, the best explanation is a burglary. Unlike deduction, abductive conclusions are not guaranteed to be true; they are educated guesses. LLMs can perform abductive reasoning in tasks where they must fill in gaps or infer hidden causes. For example, given a partial story, an AI might guess a character’s motive that best explains their actions. Abductive reasoning is valuable for commonsense inference and troubleshooting, where multiple explanations exist and the system must pick the most likely. In GenAI, one way to implement this is via a propose-and-verify CoT. The model first posits a hypothesis, then internally checks if it fits the evidence. Studies show that LLMs can benefit from this approach: for instance, treating a multiple-choice question as an abductive task, hypothesizing an answer, and then seeing if it makes sense in context often yields better results. Humans naturally switch to abductive reasoning when direct deduction is hard, and LLM agents are starting to mimic that flexibility. By incorporating abductive reasoning, AI systems become more robust to ambiguity, as they can handle incomplete information and still offer a reasonable solution.
在 GenAI 中,研究人员探索了如何通过明确的提示引导逻辑逻辑模型 (LLM) 思考可能的解释。例如,给定一个谜语或诊断问题,模型可以被引导列举潜在原因,最终得出最合理的解释。一些智能体框架通过生成假设并使用工具(例如知识库)进行验证,然后再最终确定答案,从而实现溯因推理策略。这一过程类似于先假设后检验。这种方法在诊断任务(医疗或技术)中非常有用,人工智能可以先提出症状的原因,然后将其与已知事实进行比对,从而提高不确定性下的决策能力。
Implemented in GenAI: Researchers have explored prompts that explicitly tell the LLM to think of possible explanations. For example, given a riddle or a diagnostic question, the model might be guided to enumerate potential reasons and then conclude with the most plausible one. Some agent frameworks implement abductive strategies by generating a hypothesis and using a tool (like a knowledge lookup) to verify it before finalizing the answer, a process akin to hypothesize, then test. This approach is useful in diagnosis tasks (medical or technical) where the AI suggests a cause for symptoms and then checks consistency with known facts, improving decision-making under uncertainty.
类比推理是指通过比较相似情境或结构来推断结论。本质上,人工智能利用类比,如果两个事物之间存在某些关系,那么对其中一个事物的了解可以帮助我们理解另一个事物。一个经典的例子是解决“鸟之于天空,如同鱼之于___?”这样的类比题。模型必须理解自身所处的关系,并发现鱼生活在水中。逻辑逻辑模型(LLM)能够处理简单的类比题,因为它们在训练过程中接触过大量的词语关系(同义词、类别等)。例如,GPT-4 可以通过识别功能类比来完成“刀之于切割,如同钢笔之于___,即写”这样的句子。除了文字谜题之外,类比推理还允许人工智能通过识别结构相似性,将已知的解决方案应用于新问题。逻辑逻辑模型智能体可以通过回忆一个已知的类似场景来处理新任务,然后将解决方案映射到该场景上。这有助于创造性地解决问题和消除歧义。如果指令不明确,人工智能可以从提示或记忆中回忆一个类似的例子来正确理解指令。然而,当类比抽象或需要现实世界经验时,类比推理就可能面临挑战。目前的全人类人工智能系统大多以隐式方式(通过学习语言模式)实现类比,但一些研究正在探索如何使其更加显式化。一种方法是引导模型识别一对事物之间的关系,然后将其应用于另一对事物,从而强制进行类比推理。
Analogical reasoning involves drawing parallels between similar situations or structures to infer a conclusion. In essence, the AI uses an analogy that if two things share some relationships, then knowledge about one can inform understanding of the other. A classic example is solving analogies like bird is to sky as fish is to ___? The model must see the relation it lives in and find that fish live in water. LLMs can handle simple analogies because they have seen many word relationships (synonyms, categories, etc.) during training. For instance, GPT-4 can complete knife is to cut as pen is to ___, with write by recognizing the functional analogy. Beyond word puzzles, analogical reasoning lets AI apply known solutions to new problems by recognizing structural similarity. An LLM agent might approach a new task by recalling a scenario it knows that is analogous, then mapping the solution over. This contributes to creative problem-solving and disambiguation. f an instruction is unclear, the AI might recall an analogous example from its prompt or memory to interpret it correctly. However, analogical reasoning can be challenging when the analogy is abstract or requires real-world experience. Current GenAI systems implement analogies mostly implicitly (through learned language patterns), but research is emerging to make this more explicit. One approach directs the model to identify the relationship in one pair and then apply it to another, thereby forcing an analogical CoT.
在 GenAI 中实现:类比推理虽然不像其他推理类型那样经常被强调,但它存在于诸如隐喻理解或学术能力测验( SAT ) 式的类比题等任务中。提示策略可以通过询问“这种情况与已知场景有何相似之处?”来鼓励类比推理。一些实验方法会给逻辑推理模型 (LLM) 提供一些类比示例供其参考。例如,提示可以显示:“巴黎之于法国,正如东京之于日本(国家-首都关系)” ,然后要求模型将这种关系应用于新的一对事物。通过这种方式,模型可以明确地寻找类比关系。鼓励类比推理有助于知识迁移。例如,多模态智能体可以推断握铅笔与握画笔类似,从而迁移运动技能;或者,逻辑推理模型可以通过回忆类似谜题的解法来解决谜题。最近的研究甚至训练元模型,为给定的问题选择最佳推理方式(演绎推理、溯因推理或类比推理),这表明添加类比思维可以扩大可解决任务的范围。
Implemented in GenAI: It is not commonly highlighted as other reasoning types, but analogical reasoning is present in tasks like metaphor understanding or Scholastic Aptitude Test (SAT)-style analogy questions. Prompting strategies can encourage analogy by asking, how is this situation similar to a known scenario? Some experimental methods give the LLM examples of analogies to follow. For example, a prompt might show: Paris is to France as Tokyo is to Japan (country-capital relationship), and then ask the model to apply that relation to a new pair. By doing so, the model explicitly searches for the analogous relationship. Encouraging analogies helps in knowledge transfer. For instance, a multimodal agent could reason that holding a pencil is analogous to holding a paintbrush to transfer motor skills, or an LLM could solve a puzzle by recalling a similar puzzle’s solution format. Recent work even trains meta-models to pick the best reasoning style (deductive vs. abductive vs. analogical) for a given problem, illustrating that adding analogical thinking can expand the range of solvable tasks.
常识推理是指人工智能运用日常世界知识和人类习以为常的逻辑的能力。这包括基本事实(水是湿的)、时空常识(人不会穿墙而过)、社会规范以及典型情境中的因果关系。它对于理解隐含意义和避免给出荒谬的答案至关重要。逻辑学习模型(LLM)从训练文本中学习到大量的常识,但它们并非总能可靠地运用这些常识。例如,一个朴素模型可能会回答“大象能穿过门吗?”这个问题。如果大象能挤过去,答案可能是肯定的,这表明它缺乏常识性的物理推理能力。而通过适当的技术,生成模型可以进行推理。例如,大象体型太大,普通门进不去,所以答案应该是“否”。一种成功的方法是使用常识提示(CoT)来注入常识,例如通过逐步引导模型完成某个场景,可以在每一步提醒模型相关的常识知识。事实上,研究发现,通过让模型在回答问题之前阐明因果关系和世界知识,常识提示可以提高模型在常识问答任务上的表现。例如,假设正在下雨,约翰把伞忘在家里了。当他走到外面时会发生什么?常识提示可能会明确指出正在下雨,而且没有伞,约翰会被淋湿,从而得出约翰会被淋湿的答案。常识推理极大地帮助消除歧义;它帮助人工智能选择在上下文中合理的解释(例如,理解习语,根据合理的意图解析代词)。现代逻辑逻辑模型(LLM)还会利用外部知识库或常识工具:如果遇到不确定的情况,智能体可以查询事实数据库(例如询问大象是否能通过门),以避免犯低级错误。通过融入常识,人工智能系统变得更加稳健,更符合人类的预期,从而提升其在开放式现实场景中的决策能力。
Commonsense reasoning is the ability of an AI to use everyday world knowledge and obvious logic that humans take for granted. This includes basic facts (water is wet), spatial-temporal common sense (people do not walk through walls), social norms, and cause-and-effect in typical situations. It is crucial for understanding implicit meanings and avoiding nonsensical answers. LLMs learn a great deal of common sense from their training text, but they may not always apply it reliably. For example, a naive model might answer the question, can an elephant fit through a doorway? The answer can be yes, if it squeezes, demonstrating a lack of commonsense physical reasoning. With proper techniques, generative models can reason that an elephant is too large for a standard door, so the answer should be no. One successful approach is using a CoT to inject commonsense, such as by walking through a scenario step-by-step, the model can be reminded of common knowledge at each step. Indeed, CoT prompting has been found to improve performance on commonsense QA tasks by letting the model articulate cause-and-effect and world knowledge before answering. For example, given that it is raining and John left his umbrella at home. What will happen when he walks outside? A CoT might explicitly note that it is raining, and without an umbrella, John will get wet, leading to the answer that John will get soaked. Commonsense reasoning greatly aids disambiguation; it helps an AI choose interpretations that make sense in context (e.g., understanding idioms, resolving pronouns by plausible intent). Modern LLMs also leverage external knowledge bases or tools used for commonsense: if unsure, an agent can query a fact database (like asking if elephants fit through doors) to avoid silly mistakes. By building in commonsense, AI systems become more robust and aligned with human expectations, improving their decision-making in open-ended real-world scenarios.
在 GenAI 中实现:常识推理通常可以通过提示工程和在专用数据上进行微调来增强。像 CommonsenseQA 或 StrategyQA 这样的数据集通过日常推理问题训练模型,从而提高其对物理和社会逻辑的内在理解。在提示中,开发者可能会包含一些显而易见的事实陈述(例如:提醒:大象比门大),以此来引导模型。CoT 非常有用,因为像 GPT-4 这样的模型可以在回答问题之前被提示解释某个场景(例如:杯子从桌子上掉下来了,所以它很可能碎了,因为杯子很脆弱),从而确保它们考虑了常识。另一种方法是检索增强:如果一个问题需要常识知识(例如,大象能穿过门吗),LLM 智能体可以使用搜索工具来查找大象的典型体型或已知事实。这种工具增强的推理方式模拟了人类回忆事实或查阅参考资料的方式,从而得出既正确又合理的答案。通过将固有的模型知识与外部信息和明确的推理步骤相结合,当前的 AI 系统比前几代系统能够更好地处理常识性查询。
Implemented in GenAI: Commonsense reasoning is often enhanced by prompt engineering and fine-tuning on specialized data. Datasets like CommonsenseQA or StrategyQA train models on everyday reasoning questions, improving their internal grasp of physical and social logic. In prompts, developers might include statements of obvious facts (Reminder: elephants are bigger than doors) to cue the model. CoT is helpful as models like GPT-4 can be prompted to explain a scenario (the cup fell off the table, so it likely broke because cups are fragile) before answering, ensuring they consider general knowledge. Another approach is retrieval-augmentation: if a question needs commonsense knowledge (e.g., do elephants fit through doors), an LLM agent can use a search tool to check typical elephant sizes or known facts. This tool-augmented reasoning mimics how humans recall facts or consult references, leading to answers that are both correct and make sense. By combining innate model knowledge with external information and explicit reasoning steps, current AI systems handle commonsense queries much better than earlier generations.
因果推理是指通过识别事件的起因或预测其结果来理解因果关系的能力。例如,具备因果推理能力的人工智能可以推断出玻璃杯掉在坚硬的地面上会破碎,或者反过来推断出下雨后街道会湿滑。这种推理对于规划和预测任务至关重要。在通用人工智能(GenAI)中,当模型需要推理某事发生的原因或假设情景时,因果推理就发挥作用了。逻辑逻辑模型(LLM)有时可以通过依赖模式来推断因果关系(例如,文本中常见的“下雨导致街道湿滑”),但真正的因果推理很难,因为数据中的相关性并不总是意味着因果关系。为了改进这一点,可以使用CoT提示让模型明确地考虑因果链:例如,X发生了,导致Y,Y又导致Z。通过列举这些链接,模型可以避免逻辑跳跃。决策的一个有趣优势在于,具备因果推理能力的智能体能够预见其行为的结果(这在规划任务或游戏环境中非常有用)。例如,一个用于机器人规划的逻辑逻辑模型(LLM)可能会这样推理:如果我打翻花瓶,它会摔碎,让用户难过,所以我应该避免这种情况。这种正向模拟就是因果推理的体现。它还有助于消除歧义;例如,考虑这样一个问题:早上草坪是湿的。可能的原因是什么?一个具备因果推理能力的逻辑学习模型(LLM)可以提出,也许是昨晚下雨了,或者洒水器运行了,其目的是应用现实世界中关于典型原因的知识。一些专门的基准测试(例如 CLadder、CausalQA)会测试 LLM 对因果关系的理解能力,结果表明,带有推理提示的大型模型比随机猜测更能准确地识别因果关系。然而,纯粹基于文本的模型仍然可能被表面线索所迷惑,因此研究人员会整合因果图或结构化知识来增强这种能力。因果推理最终通过确保人工智能的行为和答案能够逻辑地从原因中得出,从而增强其鲁棒性,并且能够更可靠地处理假设性问题。
Causal reasoning is the ability to understand cause-and-effect relationships by identifying what leads to an event or predicting its outcomes. For example, an AI capable of causal reasoning can infer that a glass falling on a hard floor | shatters, or conversely, reason that it rained | the street is wet. This type of reasoning is vital for planning and prediction tasks. In GenAI, causal reasoning comes into play when models need to reason about why something happened or what-if scenarios. LLMs can sometimes infer causal links by relying on patterns (rain leads to wet streets is common in text), but true causal inference is hard because correlation in data is not always causation. To improve this, CoT prompting can be used to have the model explicitly consider causal chains: e.g., X happened, which would cause Y, which in turn causes Z. By enumerating these links, the model can avoid logical leaps. One interesting benefit of decision-making is that an agent with causal reasoning will foresee the outcome of its actions (useful in planning tasks or game environments). For instance, a robot-planning LLM might reason: if I knock over the vase, it will break and upset the user, so I should avoid that. This forward simulation is causal reasoning at work. It also aids disambiguation; consider a question like, the lawn is wet in the morning. What might be the cause? A causal reasoning LLM can propose, maybe it rained overnight, or the sprinkler ran, intending to apply real-world knowledge of typical causes. Some specialized benchmarks (e.g. CLadder, CausalQA) test LLMs on cause-effect understanding, and results show that larger models with reasoning prompts can identify causal relations more often than chance. Still, purely text-based models can be fooled by surface cues, so researchers integrate causal diagrams or structured knowledge to solidify this ability. Causal reasoning ultimately contributes to an AI’s robustness by ensuring its actions and answers follow logically from causes, and it can handle what-if questions more reliably.
在 GenAI 中实现:现有系统通过提示和架构相结合的方式增强因果推理能力。在提示方面,诸如反事实提示之类的技术会要求模型设想不同的原因并检查其一致性(如果 X 没有发生,Y 还会发生吗?)。这促使逻辑逻辑模型 (LLM) 区分单纯的相关性和实际的依赖关系。CoT 可以明确地提示:让我们逐步分析因果链。在架构方面,一些方法将文本转换为结构化形式,例如因果图,然后基于这些结构进行推理。例如,可以引导 LLM 阅读一段文字,提取事件及其时间顺序或因果关系,从而形成一个小型知识图谱。然后,它可以通过内部模块或生成逻辑解释来推理该图谱,以回答问题或做出决策。这种方法通过将文本转换为时间线图,然后借助 CoT 在该图谱上进行推理,从而改进了时间推理和因果推理能力。此外,借助工具增强的智能体可以通过查询因果数据库或运行模拟来进行因果推理。例如,人工智能可以使用物理引擎工具来预测物理行为的结果,从而使其因果预测与现实相符。所有这些实现方式都旨在确保人工智能不仅知道发生了什么,而且还知道为什么会发生,从而使其行为更加可靠和可解释。
Implemented in GenAI: Current systems enhance causal reasoning through a mix of prompting and architecture. On the prompting side, techniques like counterfactual prompts ask the model to imagine different causes and check consistency (If X had not happened, would Y still happen?). This encourages the LLM to distinguish mere correlation from actual dependency. CoT can explicitly prompt: let us analyze the causal chain step-by-step. On the architecture side, some approaches convert text into structured forms like causal graphs and then reason over them. For example, an LLM can be guided to read a paragraph and extract events and their temporal order or causal links, forming a mini knowledge graph. It might then reason over this graph (either with an internal module or by generating a logical explanation) to answer a question or make a decision. Such a method was used to improve temporal and causal reasoning by translating text into a timeline graph and then performing reasoning with the help of CoT on that graph. Additionally, tool-augmented agents can do causal reasoning by querying cause-effect databases or running simulations. For instance, an AI might use a physics engine tool to predict outcomes of physical actions, thereby grounding its causal predictions in reality. All these implementations aim to ensure the AI not only knows that something happens, but why, thereby making its behavior more reliable and interpretable.
空间推理是指对空间、几何形状和物理布局进行推理的能力:理解左右、上下、距离等关系,以及物体如何相互契合。对人类而言,这种能力支撑着从打包行李到导航等各种任务。对于人工智能来说,空间推理可以指阅读场景的文本描述并确定空间关系,或者观察图像并理解物体的排列方式。仅凭文本输入的语言学习模型(LLM)通常难以处理用语言描述的复杂空间问题。例如,一个基于文本的谜题可能会说红球在蓝球左侧两个位置,而蓝球并非位于最左侧,然后询问红球在哪个位置。如果没有图示,模型必须模拟一个心理地图。生成模型在处理此类任务时一直面临困难,因为仅凭语言很难追踪多个相对位置。然而,研究人员开发了一些提示策略来辅助解决这些问题。一种有效的方法是符号链(CoS )提示,它让模型在推理之前将空间描述转换为简化的符号表示(例如网格或坐标列表)。通过使用符号(例如,通过对物体和位置进行缩写表示,该模型可以在内部绘制心理地图,然后回答相关问题。这种方法极大地提高了空间任务(例如规划和导航指令)的准确性。例如,在一个例子中,模型被要求判断一个物品清单中有多少是蔬菜(这需要它识别并计数物品)。通过将物品表示为类别字典,LLM 可以逐步计数蔬菜并得出正确答案 (7)。空间推理对于多模态智能体(例如机器人或视觉语言模型( VLM ))至关重要,因为它们必须解释现实世界的布局。空间推理有助于提高性能的稳健性,避免出现荒谬的输出(如果不可能,具有空间感知能力的 AI 不会说猫在封闭的盒子里),并实现更好的规划(了解一个物体相对于另一个物体的位置)。
Spatial reasoning is the capacity to reason about space, geometry, and physical layouts: understanding relationships like left-right, above-below, distances, or how objects fit together. In humans, this underpins tasks from packing a suitcase to navigating a route. For AI, spatial reasoning can mean reading a textual description of a scene and determining spatial relations or looking at an image and understanding object arrangements. LLMs on their own (with only text input) often struggle with complex spatial problems described in language. For example, a text-based puzzle might say the red ball is two spots to the left of the blue ball, which is not at the leftmost position, and ask which spot the red ball is in. Without a diagram, the model must simulate a mental map. Generative models have had difficulty with such tasks because keeping track of multiple relative positions is challenging in pure language form. However, researchers developed prompting strategies to help. One effective method is Chain-of-Symbol (CoS) prompting, which has the model convert spatial descriptions into a simplified symbolic representation (like a grid or list of coordinates) before reasoning. By using symbols (e.g., abbreviations for objects and positions), the model can internally draw a mental map and then answer questions about it. This approach greatly improved accuracy on spatial tasks such as planning and navigation instructions. For instance, in one example, the model was asked about a list of items and had to figure out how many were vegetables (requiring it to identify items and count them). By representing the items as a dictionary of categories, the LLM could count vegetables and arrive at the correct answer (7) in a step-by-step manner. Spatial reasoning is crucial for multimodal agents (like robots or vision-language models (VLMs)) because they must interpret real-world layouts. It contributes to robust performance by preventing absurd outputs (an AI with spatial sense will not say the cat is inside the closed box if not possible) and allowing better planning (knowing an object’s location relative to another).
在 GenAI 中,空间推理的实现方式包括专门的提示和多模态模型设计。如前所述,在文本方面,CoT 可以整合符号:例如,提示可以指示“让我们使用坐标,标记每个物体的位置,然后给出答案”。这已被证明可以节省令牌并提高复杂空间谜题的准确率。对于导航或路径查找,基于 LLM 的智能体可以通过内部模拟在文本描述的地图上的移动来输出逐步指示。在多模态领域,VLM(例如 GPT-4V 或 PaLM-E)通过处理图像来执行空间推理。这些模型使用融合架构,结合视觉和文本特征,使它们能够查看房间图像并回答空间问题(椅子在桌子的左边吗?)。一些高级系统甚至允许 LLM 在推理过程中操作图像,例如,Visual ChatGPT或OpenAI 的 Visual CoT可以旋转或缩放图像以更好地检查细节。这类似于人类歪着头来理解场景。这种工具辅助的视觉推理使人工智能能够更准确地处理空间任务。总而言之,通过整合空间表征(无论是通过符号还是视觉输入),GenAI 在模拟我们对物理世界理解的任务中变得更加强大。
Implemented in GenAI: Spatial reasoning is implemented both through specialized prompting and multimodal model design. On the text side, as noted, CoT can incorporate symbols: e.g., the prompt can instruct, let us use coordinates, mark positions of each object, and then answer. This was shown to save tokens and boost accuracy on complex spatial puzzles. For navigation or pathfinding, LLM-based agents can output step-by-step directions by internally simulating movements on a map described in text. In the multimodal realm, VLMs (like GPT-4V or PaLM-E) inherently perform spatial reasoning by processing images. These models use fusion architectures that combine visual and textual features, allowing them to, say, look at an image of a room and answer spatial questions (is the chair to the left of the table?). Some advanced systems even allow the LLM to manipulate images as part of reasoning, for example, Visual ChatGPT or OpenAI’s Visual CoT can rotate or zoom into an image to better inspect details. This is akin to a human tilting their head to understand a scene. Such tool-assisted visual reasoning enables the AI to handle spatial tasks with greater accuracy. Overall, by integrating spatial representations (either via symbols or via visual inputs), GenAI becomes much more capable at tasks that mirror our physical world understanding.
时间推理是指对时间、事件顺序、持续时间、频率和时间关系(例如,之前/之后、当……时、直到……为止等)进行推理。对于人工智能而言,时间推理对于理解故事、安排任务或理解过程至关重要。例如,人工智能应该从叙述中推断出爱丽丝在上班前吃完了早餐,这意味着早餐发生的时间早于上班。虽然这听起来很简单,但逻辑推理模型(LLM)可能会被复杂的基于时间的逻辑所困扰,尤其是在事件描述顺序与时间顺序不符或涉及隐式时间跳跃的情况下。时间推理还包括理解持续时间(例如,如果被告知约翰从下午 1 点开始睡了 2 小时的午觉,人工智能应该推断出他下午 3 点醒来)。在通用人工智能(GenAI)系统中,强大的时间推理能力可以确保故事的一致性(没有角色会凭空知道尚未发生的事情)、对序列问题的正确回答以及智能体的合理规划。研究表明,逻辑推理模型在处理时间逻辑方面仍然面临挑战,通常需要增强功能才能胜任。例如,一项研究指出,时间推理任务需要多种技能的结合,包括逻辑排序、基本算术(用于计算日期或持续时间)以及对典型时间线的常识性知识。为了提高LLM(时间推理能力)的表现,一种方法是使用中间时间表示,例如时间线或时间图(TG )。最近的一种方法是将描述事件的文本转换为时间图(TG),即结构化的时间线,然后逻辑学习模型(LLM)使用时间概念(CoT)步骤对该图进行推理。通过将事件显式映射到时间线,模型可以更轻松地回答诸如“X 之前发生了什么?”或“Y 是否发生在 Z 之后?”之类的问题。与让 LLM 自由构建其内部时间线相比,这种方法产生了更可靠的推理步骤和答案。在交互式智能体中,时间推理允许进行时间规划(例如,确定执行顺序:先加热烤箱,再混合食材,因为烤箱需要预热)。它还有助于消除歧义,如果提到两个相似的事件,了解哪个事件先发生可以澄清上下文(例如在故事或历史问题中)。总而言之,时间推理增强了人工智能的鲁棒性和连贯性,确保以类似人类的方式处理时间维度。
Temporal reasoning is reasoning about time, the order of events, durations, frequencies, and temporal relationships (before/after, while, until, etc.). For AI, temporal reasoning is needed to interpret stories, schedule tasks, or understand processes. For example, an AI should infer from a narrative that Alice finished breakfast before going to work, which implies breakfast happened earlier than work. While this sounds simple, LLMs can get confused with complex time-based logic, especially when events are described out of chronological order or involve implicit time jumps. Temporal reasoning also includes understanding durations (e.g., if told John took a 2-hour nap starting at 1 PM, the AI should conclude he woke at 3 PM). In GenAI systems, robust temporal reasoning ensures consistency in stories (no character magically knowing something that has not happened yet), correct answers in questions about sequences, and proper planning for agents. Research indicates that LLMs still struggle with temporal logic and often require augmentations to handle it. For instance, one study noted that temporal reasoning tasks require a combination of skills, logical ordering, basic arithmetic (for dates or durations), and commonsense knowledge of typical timelines. To improve LLM performance, a technique has been to use an intermediate temporal representation, such as a timeline or temporal graph (TG). In a recent approach, text describing events is converted into a TG, a structured timeline, and then the LLM reasons over that graph using CoT steps. By explicitly mapping events to a timeline, the model more easily answers questions like what happened just before X? or did Y happen after Z? This method yielded more reliable reasoning steps and answers than letting the LLM free-form its internal timeline. In interactive agents, temporal reasoning allows planning over time (e.g., figuring out an order of execution: first heat the oven, then mix ingredients, because the oven needs preheating). It also helps with disambiguation, if two similar events are mentioned, understanding which came first can clarify context (as in stories or historical questions). Overall, temporal reasoning adds to an AI’s robustness and coherence, ensuring that the dimension of time is handled in a human-like way.
GenAI 中的实现:时间推理能力的提升源于对模型进行时间相关的显式训练。一种策略是时间 CoT(时间概念时间),即提示引导 LLM(逻辑逻辑模型)按顺序排列事件或逐步计算时间差。例如,提示可以这样说:“让我们先按事件发生的时间对这些事件进行排序,然后再回答相关问题。” 另一种策略是集成工具:LLM 代理可以调用日历 API 或日期计算器来处理复杂的日期运算(例如,从星期二算起 45 天是星期几?),从而避免错误。如前所述,将文本转换为时间图就像给模型提供一个内部时间线供其参考。构建这样的时间图(节点代表事件,边代表时间关系)后,AI 可以使用逻辑模块查询它,或者使用学习到的推理步骤遍历它。此外,专门的训练数据也能有所帮助,例如,使用带有标注事件时间线的故事或关于时间的数学应用题来微调模型(这样模型就能学习诸如经过时间之类的概念)。总而言之,GenAI 系统正越来越多地通过将语言模型与结构化时间表示相结合,并引导它们按时间顺序思考,来解决时间推理问题,从而更准确地理解事情发生的时间和顺序。
Implemented in GenAI: Improvements in temporal reasoning come from explicitly teaching models about time. One strategy is temporal CoT, where the prompt guides the LLM to list events in order or compute time differences step-by-step. For example, a prompt might say: Let us sort these events by when they happened, before answering a question about them. Another strategy is integrating tools: an LLM agent might call a calendar API or a date calculator to handle tricky date arithmetic (like what day of the week will it be 45 days from Tuesday?) to avoid mistakes. As mentioned, converting text to a temporal graph is like giving the model an internal timeline to consult. After building such a graph (nodes as events, edges as temporal relations), the AI can either query it with a logical module or traverse it with learned reasoning steps. Also, specialized training data can help, e.g., fine-tuning a model on stories with annotated event timelines or on math word problems about time (so it learns concepts like elapsed time). In summary, GenAI systems are increasingly addressing temporal reasoning by combining language models with structured time representations and by prompting them to think chronologically, which leads to a more accurate understanding of when things happen and in what sequence.
数学推理是指解决数学问题并进行正确计算或符号运算的能力。这涵盖了从基本算术(例如 12 × 9 等于多少?)到复杂的文字题,甚至定理证明等各个方面。历史上,纯粹的神经语言模型因算术错误或无法解决多步骤数学问题而臭名昭著,因为它们倾向于基于模式识别来猜测答案。然而,借助诸如认知理论(CoT)之类的技术,语言学习模型(LLM)在数学问题解决方面取得了显著进步。关键在于,数学需要演绎式的、逐步推理,而这正是认知理论提示所鼓励的。例如,考虑这样一个文字题:罗杰有五个网球。他买了两个罐头,每个罐头装三个网球。他现在有多少个网球?如果直接询问模型,它可能会给出错误的答案,但如果使用认知理论提示,它就会给出正确的答案:他有 5 个网球。两个罐头,每个罐头装 3 个网球,所以他现在有 6 个网球。 5 + 6 = 11 ,然后得出结论:11。通过将每个步骤外部化(而不是试图在隐藏层中完成所有操作),该模型显著减少了错误。数学推理不仅仅是算术;它还包括代数推理(求解X )、几何推理(关于形状),甚至更多。逻辑谜题,例如数独,在逻辑逻辑模型(LLM)方面仍存在局限性,尤其是在缺乏工具辅助的情况下;由于长度和精度的限制,它们在处理非常大的数字或冗长的证明时可能会遇到困难。为了提高性能和准确性,人工智能系统通常会集成基于工具的数学方法。LLM 代理可以调用计算器或 Python 解释器进行精确计算,从而避免简单的算术错误。这种工具的使用已被证明能够有效地消除计算错误,而 LLM 则可以专注于正确地设置问题。LLM 的推理能力与工具的精确性相结合,能够产生正确且可解释的解决方案。模型用文字解释推理过程,而工具则提供数值答案。人工智能中的数学推理能力能够提升其在基准测试(例如 GSM8K,一个数学应用题集)上的性能,并且是衡量人工智能处理系统性逻辑任务能力的重要指标。
Mathematical reasoning refers to the ability to solve mathematical problems and perform correct calculations or symbol manipulations. This ranges from basic arithmetic (what is 12 × 9?) to complex word problems or even proving theorems. Historically, pure neural language models were notorious for making arithmetic errors or failing at multi-step math problems because they tended to guess answers based on pattern recognition. However, with techniques like CoT, LLMs have shown remarkable improvements in math problem-solving. The key is that math requires deductive, stepwise reasoning, exactly what CoT prompting encourages. For example, consider a word problem: Roger has five tennis balls. He buys two cans of three tennis balls each. How many balls does he have now? If asked naively, a model might output a wrong guess, but with a CoT prompt, it will do: He had 5. Two cans of 3 each means 6 more. 5 + 6 = 11. and then conclude 11. By externalizing each step (instead of trying to do it all in the hidden layers), the model dramatically reduces errors. Mathematical reasoning is not just arithmetic; it includes algebraic reasoning (solving for X), geometric reasoning (about shapes), and even logical puzzles like Sudoku. LLMs still have limits here, especially without tools; they might falter on very large numbers or long proofs because of length and precision limits. To bolster performance and accuracy, AI systems often incorporate tool-based approaches for math. An LLM agent can call a calculator or a Python interpreter for exact computation, ensuring no simple arithmetic mistakes. This kind of tool use has been shown to essentially eliminate calculation errors while the LLM focuses on setting up the problem correctly. The synergy of the LLM’s reasoning and the tool’s precision yields both correct and explainable solutions. The model explains the reasoning in words, and the tool provides the numeric answer. Mathematical reasoning in AI leads to improved performance on benchmarks (like GSM8K, a math word problem set) and is a good indicator of an AI’s ability to handle systematic logical tasks.
在 GenAI 中实现的Cot提示是提升低层次数学模型 (LLM) 数学推理能力的关键范式转变。开发者在提示中包含带有分步解答的示例,或者指示模型逐步思考数学问题。这使得像 GPT-3.5 这样的模型能够正确解答许多它们以前无法解答的小学生数学题。对于更高级的数学或更复杂的计算,集成外部工具是一种常见的做法。例如,OpenAI 的代码解释器允许 ChatGPT 编写和运行 Python 代码;用户可以提出一个复杂的数学问题,模型会生成一个小型脚本来计算答案,将逻辑推理的设置与机器的完美计算相结合。在像 ReAct 这样的智能体框架中,一个数学问题可能会触发 LLM 发出一个动作 ` Calculator[表达式]` ,获取结果,然后基于该结果继续推理。还有一些专门的神经符号模型(例如用于编码的AlphaCode或用于定理求解的MetaMath ),它们将神经网络与形式化数学求解器相结合。这些系统通过生成假设(潜在解)并对其进行形式化验证来处理数学问题,这与人类检验方程解的方式非常相似。总而言之,数学推理是通过精心设计的提示来实现的,这种提示鼓励逻辑推理,有时还会结合符号模块或工具来执行繁重的数学运算,从而使人工智能能够在答案中实现正确性和清晰的论证。
Implemented in GenAI: Cot prompting is the main paradigm shift that unlocked much better mathematical reasoning in LLMs. Developers include worked examples with step-by-step solutions in the prompt or instruct the model to think step-by-step for math questions. This has enabled even models like GPT-3.5 to solve many grade-school math problems correctly, where they previously failed. For higher-level math or longer calculations, integrating external tools is common. For instance, OpenAI’s code interpreter allows ChatGPT to write and run Python code; a user can ask a complex math question, and the model will generate a small script to compute the answer, combining logical setup from its reasoning with flawless computation by the machine. In agent frameworks like ReAct, a math question might trigger the LLM to issue an action, Calculator[expression], get the result, and then continue the reasoning with that number. There are also specialized neuro-symbolic models (like AlphaCode for coding or MetaMath for theorem solving) that blend neural networks with formal math solvers. These systems treat math problems by generating hypotheses (potential solutions) and formally verifying them, much like a human might test an equation solution. In summary, mathematical reasoning is implemented through careful prompt design that encourages logical breakdown, sometimes combined with symbolic modules or tools that execute the grunt work of math, allowing the AI to achieve both correctness and clear justification in its answers.
在人工智能领域,基于工具的推理指的是智能体在其推理过程中使用外部工具或应用程序接口(API)(例如搜索引擎、计算器、数据库,甚至是其他人工智能模型)的能力。人工智能不再仅仅依赖其内部知识,而是能够识别何时可以使用工具,然后采取行动获取信息或执行操作,并根据结果进行推理。ReAct 框架是形式化这一过程的领先范式。在 ReAct 智能体中,逻辑逻辑模型(LLM)并非仅仅输出答案;它将思维过程(CoT 推理)与工具调用等行动交织在一起。例如,考虑一个复杂的问题:2018 年 FIFA 世界杯冠军国家的首都是哪里?使用工具的智能体会认为这个问题是在询问 2018 年世界杯冠军国家的首都。这个国家是法国(赢得了 2018 年世界杯)。法国的首都是巴黎。然而,可以肯定的是,它可能会执行以下操作:搜索2018 年世界杯冠军(得到法国),然后搜索法国首都(得到巴黎),最后回答“巴黎”。在此过程中,智能体的推理轨迹可能如下所示:
Tool-based reasoning in AI refers to an agent’s ability to use external tools or APIs (such as search engines, calculators, databases, or even other AI models) as part of its reasoning process. Instead of relying solely on its internal knowledge, the AI recognizes when a tool can help and then acts to fetch information or perform an operation, and then reasons with the result. The ReAct framework is a leading paradigm that formalizes this process. In ReAct agents, the LLM does not just output an answer; it interleaves thoughts (CoT reasoning) with actions like tool calls. For example, consider a complex question: what is the capital of the country that won the FIFA World Cup in 2018? A tool-using agent will think that the question asks for the capital of the country that won in 2018. That country was France (won the 2018 World Cup). The capital of France is Paris. However, to be sure, it might perform actions: search for the 2018 World Cup winner (gets France), then search for the capital of France (gets Paris), and then answer Paris. During this process, the agent’s reasoning trace might look like:
想法:我需要找到2018年世界杯的冠军。
Thought: I need to find the World Cup 2018 winner.
行动方向是搜索(2018年世界杯冠军)
Action would be search(2018 World Cup winner)
观察:法国在2018年获胜。
Observation: France won in 2018.
思考:现在找出法国的首都。
Thought: Now find the capital of France.
行动将是搜寻(法国首都)
Action would be search(capital of France)
观察:首都是巴黎。
Observation. The capital is Paris.
我当时想,所以答案是巴黎。
Thought would be, So the answer is Paris.
该框架提高了准确性和鲁棒性,因为模型可以获取最新或精确的信息,而不是进行猜测(从而减少了臆测)。它还有助于消除歧义,例如,如果问题不明确,智能体可以快速查找信息或提出澄清问题。根据 ReAct 论文,此类智能体在知识密集型任务上表现出更优异的性能,并减少了由模型不确定性导致的错误。本质上,基于工具的推理使人工智能系统能够克服其训练局限性。如果逻辑逻辑模型 (LLM) 不知道某些信息(例如,最近发生的事件或复杂的计算),工具调用可以提供该知识,然后 LLM 的推理过程可以将这些信息整合到答案中。这种协同作用模拟了人类的思维方式:我们使用记事本进行计算,使用搜索引擎查找事实等等,从而产生更可靠、更值得信赖的人工智能输出。
This framework improves accuracy and robustness because the model can fetch up-to-date or precise information rather than guessing (reducing hallucinations). It also helps with disambiguation, like if the question is unclear, the agent can do a quick lookup or ask a clarifying question as a tool. According to the ReAct paper, such agents showed superior performance on knowledge-intensive tasks and reduced errors that come from the model’s uncertainty. Essentially, tool-based reasoning lets AI systems overcome their training limitations. If an LLM does not know something (e.g., a very recent event or a tricky calculation), a tool call can supply that knowledge, and then the LLM’s reasoning can integrate it into the answer. This synergy mimics how humans think; we use notepads for calculation, search engines for facts, etc., resulting in more reliable and trustworthy AI outputs.
GenAI 中已实现:基于现代语言学习模型 (LLM) 的智能体(例如,使用 LangChain 等框架或 OpenAI 的函数调用 API 构建的智能体)通过结构化提示来实现工具的使用。典型的 ReAct 提示可能包含以下示例:
Implemented in GenAI: Modern LLM-based agents (e.g., those built with frameworks like LangChain, or OpenAI’s Function calling API) operationalize tool use with structured prompts. A typical ReAct prompt might include examples like:
想法:我需要知道X
Thought: I need to know X
我将使用工具 Y。操作:Y(查询)。
I will use Tool Y. Action: Y(query).
观察。
Observation.
想法:基于此,接下来我将……等等。
Thought: Based on that, next I will... and so on.
智能体持续执行此循环,直到能够得出最终答案(完成[答案] )。工具可以是任何东西:网络搜索(用于获取知识)、计算器(用于数学运算)、翻译 API、数据库查询,甚至是多模态智能体中的图像识别。提示工程确保逻辑逻辑模型 (LLM) 了解可用的工具以及如何格式化操作。由于 LLM 的认知能力 (CoT) 与操作显式关联,系统可以通过分解任务来处理非常复杂的任务:推理决定需要做什么以及按什么顺序执行,而操作则获取结果或产生更改。这极大地提高了在不熟悉的情况下做出决策的能力。例如,一个 AI 家庭助手遇到“我的网络断了,我该怎么办?”这样的问题。它可能在训练中没有这个答案,但通过使用工具,它可以执行一系列步骤:ping 服务器、阅读故障排除指南等等,然后给出解决方案。多模态智能体也使用基于工具的推理:例如,一个智能体可以查看图像,然后使用 OCR 模块作为工具读取图像中的文本,并对其进行推理。像ReAct这样的基于工具的推理框架在推动人工智能从静态问答转向交互式问题解决方面发挥了关键作用。它们有助于提高模型的鲁棒性(由于模型可以验证事实,因此错误答案更少),并在某种意义上实现了持续学习,因为模型可以随时获取更新的信息,并且较少受到固定训练数据的限制。总之,基于工具的推理赋予了GenAI以下能力:一种增强智能形式,将模型的文本推理能力与外部工具的精确能力相结合,从而在复杂的现实世界任务中取得更好的性能。
The agent continues this loop until it can formulate a final answer (finish[answer]). Tools can be anything: a web search (for knowledge), a calculator (for math), a translation API, a database lookup, or even image recognition in a multimodal agent. The prompt engineering ensures the LLM knows the tools available and how to format actions. Because the LLM’s CoT is explicitly connected to actions, the system can handle very complex tasks by decomposing them: the reasoning decides what needs to be done and in what order, and the acting fetches results or effects changes. This greatly improves decision-making in unfamiliar situations. For instance, an AI home assistant faced with my internet is down, what should I do? It might not have that answer in its training, but with tool use, it can run through steps: ping a server, read a troubleshooting guide, etc., then give a solution. Multimodal agents also use tool-based reasoning: an example is an agent that can see an image and then use an OCR module as a tool to read text in the image, then reason about it. Tool-based reasoning frameworks like ReAct have been pivotal in moving AI beyond static QA to interactive problem-solving. They contribute to robustness (fewer incorrect answers since the model can verify facts) and enable continuous learning, in a sense, because the model can always fetch updated info, and it is less constrained by the fixed training data. In sum, tool-based reasoning equips GenAI with a form of augmented intelligence, combining the model’s textual reasoning with the precise capabilities of external tools to achieve far better performance on complex, real-world tasks.
虽然上述许多推理类型都是在文本的语境下讨论的,但多模态智能体将推理扩展到各种数据类型,例如图像、音频、视频和文本的组合。在这样的系统中,推理涉及融合来自多种模态的信息,并可能使用一种模态来消除歧义或确认来自另一种模态的信息。例如,假设一个人工智能看到一张凌乱房间的图像,并被问到“吸尘器能吸到沙发底下的面包屑吗?”它必须将视觉空间推理(来自图像)与物理常识相结合才能回答这个问题。多模态推理得益于能够执行模态对齐和融合的架构。对齐是指连接对应的元素(例如,将图像描述句子与图像中的区域匹配),而融合是指联合处理输入以产生统一的理解。像 GPT-4 Vision 和 Google 的 PaLM-E 这样的模型使用 Transformer 来接受文本和视觉嵌入,因此模型能够有效地在一个组合表示中看到和读到信息。这使得模型能够识别图像中的物体,并利用世界知识对其进行推理。值得注意的是,OpenAI 最近的研究表明,一些模型能够在其认知能力中心(CoT)内进行图像思考,这意味着模型可以在逐步推理过程中执行内部视觉处理。例如,模型可能会在内部决定放大图像的某个部分或旋转图像以读取文本,所有这些都是解决问题的中间步骤。这本质上是一种多模态的 ReAct:模型将图像操作视为认知能力中心推理过程中的工具。其结果是在视觉问答(VQA )、基于图像的故障排除或基于图像的空间推理等任务中取得了显著的改进。通过融合视觉和文本推理,这些系统在需要理解两种模态的基准测试中取得了最先进的性能。例如,人工智能可以同时读取图表(图像)和相关的文本段落,进行推理,并回答需要两种模态的复杂科学问题,这是纯文本模型或纯图像模型都难以单独完成的任务。多模态融合有助于消除歧义(图像可以阐明文本所指内容,反之亦然)并增强鲁棒性(人工智能不太可能对视觉细节产生错觉,因为它能够看到这些细节)。它还开辟了更高级的应用:多模态智能体可以在物理环境中规划行动(视觉提供当前状态,语言推理提供规划能力),或者提供更丰富的解释(在进行语言推理的同时指出图像的某些部分)。
While many of the above reasoning types are discussed in the context of text, multimodal agents extend reasoning across various data types, e.g., images, audio, video, and text together. In such systems, reasoning involves fusing information from multiple modalities and potentially using one modality to disambiguate or confirm information from another. For instance, consider an AI that sees an image of a messy room and is asked, can the vacuum reach the crumbs under the couch? It must combine visual spatial reasoning (from the image) with physical commonsense to answer. Multimodal reasoning is enabled by architectures that perform alignment and fusion of modalities. Alignment means linking corresponding elements (e.g., matching a caption sentence to a region in an image), and fusion means jointly processing the inputs to produce a unified understanding. Models like GPT-4 Vision and Google’s PaLM-E use transformers that accept both text and visual embeddings, so the model effectively sees and reads in one combined representation. This allows it to do things like identify an object in an image and then reason about it with world knowledge. Notably, OpenAI’s recent research demonstrated models that think with images in their CoT, meaning the model can perform internal visual processing as part of step-by-step reasoning. For example, the model might internally decide to zoom into a part of an image or rotate it to read text, all as intermediate steps in solving a problem. This is essentially a multimodal ReAct: the model treats image manipulations as tools within a CoT reasoning process. The result is a significant improvement in tasks like visual QA (VQA), image-based troubleshooting, or spatial reasoning from pictures. By fusing visual and textual reasoning, these systems achieve state-of-the-art performance on benchmarks that require understanding both modalities. For instance, an AI can read a diagram (image) and a related text paragraph together, reason about them, and answer a complex science question that needs both modalities, something neither text-only nor image-only models could easily do alone. Multimodal fusion contributes to disambiguation (the image can clarify what the text refers to and vice versa) and to robustness (the AI is less likely to hallucinate about visual details because it sees them). It also opens advanced applications: a multimodal agent can plan actions in a physical environment (vision gives it the current state, language reasoning gives it planning ability) or provide richer explanations (pointing to parts of an image while verbally reasoning).
在 GenAI 中实现:在架构层面,多模态 Transformer 通过交叉注意力等技术整合不同模态,例如,文本标记关注图像特征图。通常有两种模式:双塔模型分别对每种模态进行编码,然后在后期进行组合(例如,通过拼接或小型融合网络);以及单塔模型,从一开始就在一个网络中处理混合模态输入。GPT-4 Vision 等模型采用的是单塔(完全融合)方法,本质上是将图像块和文本标记一起作为标记进行处理。这种紧密集成能够实现细致入微的推理,例如在生成文本时引用图像中的特定对象。在软件方面。另一方面,像Hugging GPT等框架会在一个推理循环中协调多个专家模型(一个用于视觉,一个用于语言);语言模型决定何时调用视觉模型(作为工具),然后使用其结果。这是一种实现多模态推理的模块化方法:语言学习模型(LLM)的认知能力循环(CoT)包含诸如自我提问之类的步骤;例如,“我有一张图像,让我向视觉模块请求描述,然后使用该描述来回答问题” 。此类系统已成功处理了诸如描述图像并回答后续问题之类的任务。视觉认知能力循环提示是另一种新兴技术:它不仅提示模型进行文本思考,还提示模型想象或绘制解决方案。例如,为了解决一个关于打结的难题,提示可能会鼓励模型将步骤可视化(一些研究甚至让模型生成 ASCII 伪图作为推理的一部分!)。虽然这些方法仍处于早期阶段,但它们指向了能够使用类似想象过程的人工智能。最后,多模态推理通过利用每种模态的互补优势来提高全面性;人工智能能够获得更全面的信息。经验表明,结合多种模态通常可以提高准确性和泛化能力。因此,多模态GenAI智能体能够处理以前无法完成的复杂任务(例如解释一个网络迷因,这需要视觉、语言和文化常识),这一切都得益于将上述推理类型(空间推理、因果推理、常识推理等)整合到一个统一的多模态框架中。
Implemented in GenAI: At the architecture level, multimodal transformers incorporate modalities through techniques like cross-attention, where, say, text tokens attend to image feature maps. There are generally two patterns: a two-tower model encodes each modality separately, then combines at a later stage (e.g., via concatenation or a small fusion network), and a one-tower model that, from the start, processes mixed modality input in one network. The one-tower (fully fused) approach is what models like GPT-4 Vision use, essentially treating image patches like tokens alongside text tokens. This tight integration enables nuanced reasoning, like referencing a specific object in the image when generating text. On the software side, frameworks like HuggingGPT and others orchestrate multiple expert models (one for vision, one for language) in a reasoning loop; the language model decides when to call the vision model (as a tool), and then uses the result. This is a modular way to get multimodal reasoning: the LLM’s CoT includes steps like self-questioning; I have an image, let me ask the vision module for a description, then using that description, I will answer the question. Such systems have successfully handled tasks like describing an image and then answering follow-up questions about it. Visual CoT prompting is another emerging technique: the model is prompted with not just textual thinking but also to imagine or sketch out a solution. For example, to solve a puzzle about tying knots, the prompt might encourage the model to visualize the steps (some research gets models to produce a pseudo-drawing in ASCII as part of reasoning!). While in the early stages, these approaches point towards AI that can use imagination-like processes. Finally, multimodal reasoning improves comprehensiveness by leveraging complementary strengths of each modality; the AI gets a fuller picture. Empirically, combining modalities often boosts accuracy and generalization. Multimodal GenAI agents can therefore tackle complex tasks (like explaining a meme, which needs vision + language + cultural commonsense) that were previously out of reach, all by integrating the reasoning types discussed (spatial, causal, commonsense, etc.) within a unified multimodal framework.
因此,如今的GenAI系统将这些不同的推理类型交织在一起,以实现更接近人类的智能。每种推理类型都以其独特的方式为提高AI输出的准确性、连贯性和可靠性做出贡献。例如,演绎逻辑确保在给定规则下的一致性和正确性;归纳和溯因推理允许创造性地处理不确定性;类比推理实现知识迁移;而强大的常识、因果、空间和时间推理则避免了早期模型对世界产生的种种离奇错误。数学推理和工具的使用极大地提高了精确度和事实准确性,弥补了以往模型的关键缺陷。诸如CoT提示之类的实现已经证明,提示LLM(逻辑学习模型)进行思考可以显著提高其在数学、逻辑和常识任务中的表现。像ReAct这样的智能体框架更进一步,允许模型根据其思考过程采取行动(例如,浏览或计算),这使得决策更加务实,不易产生幻觉。随着我们拥抱多模态融合,人工智能可以充分利用视觉和文本信息的丰富性,从而在复杂的现实世界场景中实现稳健的理解和推理。至关重要的是,研究表明,没有一种推理策略能够完美解决所有问题——每种方法都能以独特的方式应对特定的挑战。因此,人工智能的前沿技术在于将这些推理类型结合起来。通过为生成模型配备一套推理技能工具箱以及相应的策略选择,我们正朝着能够像人类一样灵活可靠地思考问题,甚至超越人类的人工智能系统迈进。
So, today’s GenAI systems intertwine these diverse reasoning types to achieve more human-like intelligence. Each type of reasoning contributes in its own way to making AI outputs more accurate, coherent, and reliable. For instance, deductive logic ensures consistency and correctness given rules, induction and abduction allow creativity and handling of uncertainty, analogies enable knowledge transfer, and strong commonsense, causal, spatial, and temporal reasoning prevent the bizarre mistakes earlier models made about the world. Mathematical reasoning and tool use greatly enhance precision and factual accuracy, addressing key weaknesses of the past. Implementations like CoT prompting have proven that prompting an LLM to think aloud can significantly improve performance across math, logic, and commonsense tasks. Agent frameworks like ReAct go a step further by letting the model act on its thoughts (e.g., browsing or calculating), which makes decision-making more grounded and less prone to hallucination. And as we embrace multimodal fusion, AI can draw on the full richness of visual and textual information, leading to robust understanding and reasoning in complex, real-world scenarios. Crucially, research has shown that no single reasoning strategy is best for all problems—each approach can uniquely solve certain challenges. Therefore, the cutting edge of AI is about combining these reasoning types. By equipping generative models with a toolbox of reasoning skills and the strategies to choose among them, we are moving closer to AI systems that can think through problems as flexibly and reliably as humans do, if not more so.
推理基准测试是专门的评估工具,旨在衡量逻辑逻辑模型(LLM)在解决问题、进行逻辑推理以及得出正确结论方面的能力,而不仅仅是简单的模式匹配。与传统的自然语言处理(NLP )基准测试不同,推理基准测试……侧重于语言流畅性或事实记忆的推理基准测试,旨在检验多步骤问题解决能力、数学演绎能力、因果推断能力以及规划能力,这些能力对于应对复杂的现实任务至关重要。这些基准测试提供涵盖科学、法律和常识推理等不同领域的标准化挑战性场景,帮助研究人员客观地评估模型性能、识别缺陷、比较系统并跟踪进展。它们对于确保语言学习硕士(LLM)不仅能清晰表达,而且真正具备稳健可靠的推理能力至关重要。
Reasoning benchmarks are specialized evaluation tools designed to measure how well LLMs can think through problems, make logical inferences, and arrive at correct conclusions beyond simple pattern matching. Unlike traditional natural language processing (NLP) benchmarks that focus on language fluency or factual recall, reasoning benchmarks test multi-step problem-solving, mathematical deduction, causal inference, and planning skills essential for tackling complex, real-world tasks. By providing standardized, challenging scenarios across diverse domains such as science, law, and commonsense reasoning, these benchmarks help researchers objectively assess model performance, identify weaknesses, compare systems, and track progress over time. They are critical for ensuring that LLMs are not only articulate but also genuinely capable of robust, reliable reasoning.
下表总结了用于评估LLM推理能力的广泛认可的基准,重点介绍了它们的主要目的和关注领域:
The following table summarizes widely recognized benchmarks used to evaluate the reasoning capabilities of LLMs, highlighting their primary purpose and areas of focus:
|
基准 Benchmark |
目的 Purpose |
重点 Focus |
|
大规模多任务语言理解(MMLU )、AI2推理挑战赛(ARC )、HellaSwag、小学数学8K (GSM8K ) Massive Multitask Language Understanding (MMLU), AI2 Reasoning Challenge (ARC), HellaSwag, Grade School Math 8K (GSM8K) |
一般推理能力、常识和数学能力。 General reasoning, commonsense, and math. |
具备广泛的推理能力。 Broad reasoning skills. |
|
谷歌认证问答( GPQA )、数学、LogiQA Google-proof Question and Answers (GPQA), MATH, LogiQA |
STEM 和逻辑方面的高级推理能力。 Advanced reasoning in STEM and logic. |
深度领域特定推理。 Deep domain-specific reasoning. |
|
R-Bench、OneEval、人类的最后考试(HLE ) R-Bench, OneEval, Humanity’s Last Exam (HLE) |
多学科或结构化推理。 Multidisciplinary or structured reasoning. |
具有挑战性的跨领域评估。 Challenging cross-domain evaluation. |
|
高级推理基准测试( ARB )、PlanBench Advanced Reasoning Benchmark (ARB), PlanBench |
复杂、专业的推理场景。 Complex, specialized reasoning scenarios. |
更高层次的推理深度。 Next-level reasoning depth. |
|
OptiLLMBench OptiLLMBench |
推理技术的影响。 Inference techniques impact. |
推理效率。 Reasoning efficiency. |
|
苹果的谜题 Apple’s puzzles |
压力测试推理极限。 Stress-testing reasoning limits. |
模型稳健性评估。 Model robustness evaluation. |
Table 12.1: Key benchmarks for LLM reasoning evaluation
本章为理解GenAI系统中的推理奠定了理论基础。通过探索从演绎逻辑到多模态融合等一系列推理类型,我们重点阐述了每种推理方式如何促进更智能、更可靠、更具情境感知能力的AI行为。我们考察了推理如何增强诸如消歧、规划、工具使用和解释等能力。借助CoT提示和ReAct式智能体设计等技术,推理成为指导AI输出的实用工具。这些基础知识将帮助您构建先进的GenAI系统,这些系统不仅能够生成任务,还能对复杂的现实世界任务进行推理。
In this chapter, we laid the theoretical groundwork for understanding reasoning in GenAI systems. By exploring a range of reasoning types, from deductive logic to multimodal integration, we highlighted how each contributes to more intelligent, reliable, and context-aware AI behavior. We examined how reasoning enhances capabilities like disambiguation, planning, tool use, and explanation. With techniques such as CoT prompting and ReAct-style agent design, reasoning becomes a practical tool for guiding AI outputs. This foundational understanding equips you to build advanced GenAI systems that not only generate but also reason through complex, real-world tasks.
下一章,我们将在 GenAI 系统中实现两种推理方式。
In the next chapter, we will implement two types of reasoning in GenAI systems.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
在为生成式人工智能(GenAI )的推理建立了完善的理论基础之后,我们现在将重点从推理的重要性转移到如何将其付诸实践。在本章中,我们将了解构建推理增强型GenAI系统所需的架构设计模式、框架和模块化组件。
Having established a thorough theoretical foundation for reasoning in generative AI (GenAI), we now shift focus from why reasoning matters to how it can be practically implemented. In this chapter, we will understand the architectural design patterns, frameworks, and modular components required to build reasoning-augmented GenAI systems.
你将探索使用 LangChain、Ollama 和 Python 等工具的实际应用,并学习如何将思维链( CoT ) 提示、推理和行动( ReAct ) 式的智能体工作流程以及工具增强执行相结合,构建可扩展的 AI 流水线。通过实践代码讲解和可重用模板,你将学习如何构建能够检索、推理、行动和适应文本、图像和结构化数据的系统。
You will explore real-world implementations using tools like LangChain, Ollama, and Python, and learn how to combine chain of thought (CoT) prompting, reasoning and acting (ReAct) style agent workflows, and tool-augmented execution into scalable AI pipelines. Through hands-on code walkthroughs and reusable templates, you will learn how to engineer systems that retrieve, reason, act, and adapt across text, images, and structured data.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在全面阐述GenAI系统中的推理机制。首先,本章探讨了有助于语言模型进行结构化推理的高级提示技术。随后,本章深入研究了在重排序阶段集成推理的架构框架和实现策略,在该阶段,检索到的候选结果将被评估和优化。最后,本章考察了推荐阶段的推理,展示了如何综合多源数据和用户画像以生成上下文感知的推荐。通过实际案例和设计原则,读者将深入了解如何构建智能的、具备推理能力的AI系统,以用于检索优化和个性化推荐。
This chapter aims to provide a comprehensive understanding of reasoning mechanisms within GenAI systems. It begins by exploring advanced prompting techniques that facilitate structured reasoning in language models. The chapter then delves into architectural frameworks and implementation strategies for integrating reasoning at the reranking stage, where retrieved candidates are evaluated and refined. Finally, it examines reasoning at the recommendation stage, demonstrating how multi-source data and user profiles can be synthesized to generate context-aware suggestions. Through practical examples and design principles, readers will gain insights into building intelligent, reasoning-capable AI systems for both retrieval refinement and personalized recommendations.
在理解代码和架构之前,让我们先快速了解一下推理提示技术。提示是引导和提取大型语言模型(LLM )推理能力的关键技术之一。随着模型功能的日益强大,提示策略也必须不断发展,以支持结构化、可解释的多步骤推理,而不仅仅是模式识别。本节首先介绍零样本提示和少样本提示等基础提示策略,然后逐步深入到更高级的方法,例如思维导图(CoT)、思维树(ToT )、ReAct 等,这些方法能够显式地构建和丰富 GenAI 系统中的推理过程。
Before understanding code and architecture, let us have a quick understanding of prompting techniques for reasoning. Prompting is one of the most critical techniques for guiding and extracting reasoning capabilities from large language models (LLMs). As models grow more powerful, prompting strategies must evolve to support not just pattern recognition, but structured, explainable, and multi-step reasoning. This section begins with foundational prompting strategies like zero-shot and few-shot prompting. Then it progresses towards the advanced methods such as CoT, tree of thoughts (ToT), ReAct, and others that explicitly scaffold and enrich reasoning processes in GenAI systems.
本节介绍两种基础提示策略:零样本提示和少样本提示。这两种策略被广泛用于指导逻辑学习模型(LLM)生成准确且情境感知的反应。虽然这些技术无需对模型进行微调即可有效地引出与任务相关的行为,但值得注意的是,还有更广泛的高级提示策略,例如基于情境的提示(CoT提示)、自一致性提示、工具增强提示和对比提示。
This section introduces two foundational prompting strategies, which are zero-shot prompting and few-shot prompting, that are widely used to guide LLMs in generating accurate and context-aware responses. While these techniques offer powerful ways to elicit task-relevant behavior without model fine-tuning, it is worth noting that a broader range of advanced prompting strategies, such as CoT prompting, self-consistency, tool-augmented prompting, and contrastive prompting.
零样本提示是指在不提供任何先验示例的情况下,指示模型执行任务。模型完全依赖于其预训练知识和给定的自然语言指令。
Zero-shot prompting refers to instructing a model to perform a task without providing any prior examples in the prompt. Instead, the model relies entirely on its pretrained knowledge and the natural language instructions given.
例子:
Example:
在推理任务中,零样本提示通常与过程导向的提示(例如“让我们一步一步地思考”)结合使用,以帮助引出隐含的推理链。这种方法已被证明能够通过鼓励模型将思维过程外化,从而提高算术、逻辑和常识问题的表现。
When used in reasoning tasks, zero-shot prompting is often paired with process-oriented cues such as let us think step-by-step, which help elicit implicit reasoning chains. This approach has been shown to improve performance on arithmetic, logic, and commonsense problems by encouraging the model to externalize its thought process.
其好处如下:
The benefits are as follows:
以下是其局限性:
The following are the limitations:
少样本提示是指在提示中包含少量输入/输出( I/O ) 示例。以下是一些示例,它们作为情境演示,引导模型理解任务形式、推理风格或领域预期。
Few-shot prompting involves including a small number of input/output (I/O) examples within the prompt. The following are some examples that serve as in-context demonstrations that guide the model in understanding the task format, reasoning style, or domain expectations.
示例(推理任务):
Example (reasoning task):
问:汤姆有3个苹果。他又买了2个苹果。他现在有多少个苹果?
Q: Tom has 3 apples. He buys 2 more. How many does he have?
A:汤姆一开始有3个苹果。他又买了2个。所以,3 + 2 = 5。答案:5
A: Tom starts with 3 apples. He buys 2 more. So, 3 + 2 = 5. Answer: 5
问:一瓶容量为1.5升。3瓶一共是多少升?
Q: A bottle holds 1.5 liters. How much in 3 bottles?
A:每个瓶子容量为1.5升。1.5 × 3 = 4.5。答案:4.5
A: Each bottle holds 1.5 liters. 1.5 × 3 = 4.5. Answer: 4.5
问:一辆汽车以 40 公里/小时的速度行驶 2 小时,行驶了多远?
Q: A car travels at 40 km/h for 2 hours. How far?
A:汽车以 40 公里/小时的速度行驶了 2 小时。所以,40 × 2 = 80,答案是 80。
A: The car travels at 40 km/h for 2 hours. So, 40 × 2 = 80 so Answer is 80
少量提示对于引出 CoT 推理尤其有效,因为模型在得出最终答案之前,会学习阐明中间步骤。
Few-shot prompting is specifically effective for eliciting CoT reasoning, where the model learns to articulate intermediate steps before arriving at a final answer.
其好处如下:
The benefits are as follows:
以下是其局限性:
The following are the limitations:
虽然零样本和少样本提示奠定了基础,但诸如规划、工具使用和多模态整合等高级推理任务通常需要对推理、记忆或搜索进行明确的框架式指导。以下策略代表了在逻辑逻辑模型(LLM)和基因人工智能(GenAI)智能体中实现更深入、更易解释和更稳健的推理能力的新兴最佳实践。
While zero-shot and few-shot prompting provide the foundation, advanced reasoning tasks, such as planning, tool-use, and multimodal integration, often require explicit scaffolding of reasoning, memory, or search. The following strategies represent emerging best practices for enabling deeper, more interpretable, and more robust reasoning capabilities in LLMs and GenAI agents.
本节概述了旨在为 LLM 和多模态智能体中的推理提供支持的高级提示范式:
This section provides an overview of advanced prompting paradigms that specifically aim to scaffold reasoning in LLMs and multimodal agents:
这种方法通常与理论学习(ToT)或概念学习(CoT)结合使用,能够提高存在多种可能输出的任务(例如,开放式问题、创意写作、基于事实的问答)的可靠性。重新排序可以通过学习学习模型(LLM)自评估、外部评价模型或检索引导验证来实现。
This approach is often used in conjunction with ToT or CoT and improves reliability in tasks where multiple outputs are plausible (e.g., open-ended questions, creative writing, fact-based QA). Reranking can be implemented via LLM self-evaluation, external critic models, or retrieval-guided validation.
这些先进的提示技术标志着GenAI系统在具备类人推理能力方面迈出了重要一步。通过整合结构化信息,多样化且工具增强的推理流程使语言学习模型(LLM)能够以更高的可靠性、透明度和上下文感知能力处理复杂任务。随着全人类人工智能(GenAI)系统日益融入现实世界的工作流程,使用这种以推理为中心的提示策略对于其稳健性和可信度至关重要。
These advanced prompting techniques represent a significant step toward aligning GenAI systems with human-like reasoning capabilities. By incorporating structured, diverse, and tool-enhanced reasoning flows, they enable LLMs to handle complex tasks with greater reliability, transparency, and contextual awareness. As GenAI systems are increasingly integrated into real-world workflows, the use of such reasoning-centric prompting strategies will be central to their robustness and trustworthiness.
现在我们已经建立了全面的概念理解,接下来让我们实施两个不同的场景:
Now that we have established a comprehensive conceptual understanding, let us proceed to implement two distinct scenarios:
下图展示了一种混合 RAG 架构,该架构通过结合交叉编码器和基于 LLM 的 CoT 推理进行重排序,从而提升结果的相关性。系统首先根据用户输入,利用向量数据库检索语义相似的文档,然后融合浅层语义相似性(通过交叉编码器)和基于深度推理的评分(通过 CoT 提示)来优化候选文档的排序。最后,将前 k 个重排序结果传递给 LLM 进行响应生成,从而生成上下文丰富、高度相关的响应,以满足用户复杂的查询需求。
The following figure illustrates a hybrid RAG architecture that enhances result relevance through combined reranking using both cross-encoders and LLM-based CoT reasoning. Starting from user input, the system retrieves semantically similar documents using a vector database, then refines the candidate ranking by fusing shallow semantic similarity (via cross-encoders) with deep reasoning-based scoring (via CoT prompts). The top-k reranked results are finally passed to an LLM for response generation, enabling contextually rich, highly relevant outputs tailored to complex user queries.
Figure 13.1: Hybrid RAG architecture with reranking for improved relevance
此处的重排序结合了图像相似度(1-距离)和文本相关性,后者通过语言模型(LLM)使用CoT提示进行评分。对于使用向量相似度检索到的每个候选规范,LLM会逐步推理该规范与用户查询的匹配程度,然后分配一个数值分数(0-1)。该CoT生成的分数与图像相似度分数通过加权平均(α = 0.5)进行融合。选择综合得分最高的候选规范。因此,CoT能够超越向量相似度,实现更深层次的语义理解,从而提高重排序的可解释性,并更好地契合用户意图。
The reranking here combines image similarity (1–distance) and textual relevance scored using a CoT prompt via a language model (LLM). For each candidate spec retrieved using vector similarity, the LLM is prompted to reason step-by-step about how well the specs meet the user query, then assigns a numeric score (0–1). This CoT-generated score is blended with the image similarity score using a weighted average (α = 0.5). The candidate with the highest combined score is selected. Thus, CoT enables deeper semantic understanding beyond vector similarity, improving reranking with interpretability and better alignment to user intent.
让我们深入了解代码。本节系统地解释了用 Python 实现的模块化 GenAI 流水线,该流水线专为多模态查询处理而设计,具体来说,它利用图像和文本模态将用户查询与笔记本电脑规格进行匹配。该系统基于 ChromaDB(用于向量存储)、CLIP(用于图像和文本嵌入)以及 LangGraph(用于基于 CoT 的重排序的智能体执行)等关键组件构建而成。
Let us understand the code in depth. this section provides a systematic explanation of a modular GenAI pipeline implemented in Python, designed for multimodal query handling, specifically matching user queries with laptop specifications using both image and text modalities. The system is built upon key components such as ChromaDB for vector storage, CLIP for image and text embeddings, and LangGraph for agentic execution with CoT-based reranking.
以下目录结构展示了一个多模态 RAG 系统的实现,该系统采用双阶段重排序,并结合了交叉编码器评分和基于 LLM 的推理。该架构旨在支持图像和文本模态的混合检索,并通过 LangGraph 代理集成了嵌入、索引、重排序和编排等核心组件。这种模块化布局确保了对不同重排序策略和多模态检索工作流程的灵活试验。
The following directory structure represents the implementation of a multimodal RAG system that incorporates dual-stage reranking, leveraging both cross-encoder scoring and LLM-based reasoning. Designed to support hybrid retrieval with image and text modalities, the architecture integrates core components for embedding, indexing, reranking, and orchestration via LangGraph agents. This modular layout ensures flexibility for experimenting with different reranking strategies and multimodal retrieval workflows.
Figure 13.2: Folder structure of reasoning at the reranking stage
该模块封装了用于从磁盘加载原始数据的 I/O 实用程序函数:
This module encapsulates I/O utility functions for loading raw data from disk:
def load_text_documents(folder):
文档 = {}
for file in os.listdir(folder):
如果文件以“txt”结尾:
with open(os.path.join(folder, file), "r", encoding="utf-8") as f:
docs[file] = f.read()
返回文档
此功能确保正确读取笔记本电脑的文本规格,并准备好嵌入。
def load_text_documents(folder):
docs = {}
for file in os.listdir(folder):
if file.endswith(".txt"):
with open(os.path.join(folder, file), "r", encoding="utf-8") as f:
docs[file] = f.read()
return docs
This function ensures that textual specifications for laptops are appropriately read and prepared for embedding.
def load_image_paths(folder):
返回 [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
这对于准备用于嵌入和索引的图像路径至关重要。
def load_image_paths(folder):
return [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
This is crucial for preparing image paths for embedding and indexing.
该模块提供对OpenAI CLIP模型的访问,用于将文本和图像嵌入到共享的向量空间中。为了提高计算效率,模型和处理器只需全局加载一次:
This module provides access to OpenAI’s CLIP model to embed both text and images into a shared vector space. The model and processor are loaded once globally for computational efficiency:
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
embed_text_ollama(text)
embed_text_ollama(text)
使用 CLIP 将给定的文本字符串处理并编码为 512 维嵌入:
Processes and encodes a given text string into a 512-dimensional embedding using CLIP:
def embed_text_ollama(text):
def embed_text_ollama(text):
inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
使用 torch.no_grad():
with torch.no_grad():
outputs = clip_model.get_text_features(**inputs)
outputs = clip_model.get_text_features(**inputs)
返回 outputs[0].tolist()
return outputs[0].tolist()
embed_image_ollama(image_path)
embed_image_ollama(image_path)
将图像(从磁盘加载)编码为 512 维嵌入向量:
Encodes an image (loaded from disk) into a 512-dimensional embedding vector:
def embed_image_ollama(image_path):
def embed_image_ollama(image_path):
image = Image.open(image_path).convert("RGB")
image = Image.open(image_path).convert("RGB")
inputs = clip_processor(images=image, return_tensors="pt")
inputs = clip_processor(images=image, return_tensors="pt")
使用 torch.no_grad():
with torch.no_grad():
outputs = clip_model.get_image_features(**inputs)
outputs = clip_model.get_image_features(**inputs)
返回 outputs[0].tolist()
return outputs[0].tolist()
这些嵌入使得跨模态的语义比较成为可能。
These embeddings allow for semantic comparisons across modalities.
此脚本使用 ChromaDB 构建向量索引。它执行以下步骤:
This script builds the vector index using ChromaDB. It performs the following steps:
1. 实例化 Chroma 客户端:要开始索引过程,我们首先需要建立与持久化 Chroma 客户端的连接:
1. Instantiate Chroma client: To begin the indexing process, we first establish a connection to the persistent Chroma client:
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
2. 创建或重置集合:文本和图像集合将被(重新)初始化,以避免数据过时:
2. Create or reset collections: Text and image collections are (re)initialized to avoid stale data:
如果 CHROMA_TEXT_COLLECTION 在 [c.name for c in client.list_collections()]:
if CHROMA_TEXT_COLLECTION in [c.name for c in client.list_collections()]:
client.delete_collection(name=CHROMA_TEXT_COLLECTION)
client.delete_collection(name=CHROMA_TEXT_COLLECTION)
3. 索引文本数据:通过load_text_documents()加载的文档将被嵌入并添加到 Chroma 文本集合中:
3. Index text data: Documents loaded via load_text_documents() are embedded and added to the Chroma text collection:
text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
text_collection.add(documents=[content], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": fname}])
4. 索引图像数据:类似地,图像路径被加载和嵌入,仅使用元数据(文件名)作为引用:
4. Index image data: Similarly, image paths are loaded and embedded, with only metadata (filename) used for reference:
image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
image_collection.add(documents=[""], embeddings=[emb], ids=[str(idx)], metadatas=[{"file": os.path.basename(path)}])
该模块确保所有资源都被嵌入并存储,以便下游检索。
This module ensures that all resources are embedded and stored for downstream retrieval.
该模块使用交叉编码器模型,根据查询和检索到的元数据文件名之间的语义相似性进行重排序。
This module uses a cross-encoder model for reranking based on semantic similarity between the query and retrieved metadata filenames.
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, metadatas):
pairs = [(query, doc.get("file", "")) for doc in metadatas]
scores = cross_encoder.predict(pairs)
排名 = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)
返回 [doc for doc, _ in ranking]
虽然langgraph_agent.py中未使用,但根据应用情况,它可以补充或替代基于 LLM 的重排序。
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, metadatas):
pairs = [(query, doc.get("file", "")) for doc in metadatas]
scores = cross_encoder.predict(pairs)
ranked = sorted(zip(metadatas, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked]
Although not used in langgraph_agent.py, it can complement or replace LLM-based reranking depending on the application.
该模块使用 LangGraph 定义了一个结构化代理,用于执行基于 CoT 推理的多步骤检索和重排序过程。该工作流程包含三个主要阶段:嵌入、重排序和读取。
This module defines a structured agent using LangGraph to execute a multi-step retrieval and reranking process using CoT reasoning. The workflow follows three primary stages: embedding, reranking, and reading.
langgraph_agent.py模块展示了一个基于结构化决策、模块化规划和多模态推理原则的智能体系统架构。该系统利用 LangGraph 框架,实现了一个有状态的流水线来处理用户查询,检索语义相似的候选对象,并使用混合评分机制对它们进行重新排序。
The langgraph_agent.py module exemplifies an agentic system architecture grounded in the principles of structured decision-making, modular planning, and multimodal reasoning. Leveraging the LangGraph framework, the system implements a stateful pipeline to process user queries, retrieve semantically similar candidates, and rerank them using a hybrid scoring mechanism.
该智能体的设计融合了多项先进功能,共同实现了智能、多模态和情境感知决策。其主要架构特征包括:
The agent’s design incorporates several advanced capabilities that collectively enable intelligent, multimodal, and context-aware decision-making. The key architectural features include the following:
综合得分 = α · 图片得分 + (1 − α) · 文本得分
Combined score = α · image score + (1 − α) · text score
该决策策略体现了多模态推理,使智能体能够通过加权证据聚合自主确定最相关的候选人。
This decision policy exemplifies multimodal reasoning, allowing the agent to autonomously determine the most relevant candidate through weighted evidence aggregation.
该系统展现了智能体架构的关键特性:
The system demonstrates key properties of an agentic architecture:
尽管当前设计遵循线性执行路径,不涉及动态分支或工具使用,但其底层架构具有可扩展性,能够支持条件转换、工具调用以及更复杂的智能体行为。该智能体基于 LangGraph 构建,其中每个节点执行特定功能,包括嵌入输入和使用基于 LLM 的推理对候选对象进行重新排序。该架构通过模块化状态转换和结构化提示工程,实现了可解释的多模态决策。关键组件包括:
Although the current design follows a linear execution path without dynamic branching or tool use, the underlying architecture is extensible to support conditional transitions, tool invocation, and more complex agent behaviors. The agent is built upon a LangGraph-based execution flow, where each node performs a specialized function, ranging from embedding inputs to reranking candidates using LLM-based reasoning. The architecture enables interpretable, multimodal decision-making through modular state transitions and structured prompt engineering. The key components include the following:
def llm_score(query: str, specs_text: str) -> tuple[float, str]:
提示 = (
“评估这些笔记本电脑配置在多大程度上满足用户需求。”
“先一步一步思考,然后输出两行代码:\n”
“理由:<您的分析>\n”
得分:<0 到 1 之间的单个数字>\n\n
f"用户请求:{query}\n\n笔记本电脑规格:\n{specs_text}"
)
...
def llm_score(query: str, specs_text: str) -> tuple[float, str]:
prompt = (
"Evaluate how well these laptop specs satisfy the user request.\n"
"First think step-by-step, then output exactly two lines:\n"
"Reasoning: <your analysis>\n"
"Score: <single number between 0 and 1>\n\n"
f"User request: {query}\n\nLaptop specs:\n{specs_text}"
)
...
这里使用的LLM是ChatOllama ,采用“mistral”模型。
The LLM used here is ChatOllama with the "mistral" model.
vec = [(a + b) / 2 for a, b in zip(text_vec, img_vec)] if img_vec else text_vec
res = client.get_collection(CHROMA_TEXT_COLLECTION).query(query_embeddings=[vec], ... )
图像得分 = 1 – 距离
text_score = llm_score(query, spec)
综合得分 = α·图像得分 + (1–α)·文本得分
该分数用于选择最合适的规格:
综合得分 = alpha * 图片得分 + (1 - alpha) * 文本得分
为了便于解释,CoT 推理过程也会被记录下来。
vec = [(a + b) / 2 for a, b in zip(text_vec, img_vec)] if img_vec else text_vec
res = client.get_collection(CHROMA_TEXT_COLLECTION).query(query_embeddings=[vec], ...)
image_score = 1 – distance
text_score = llm_score(query, spec)
combined_score = α·image_score + (1–α)·text_score
This score is used to select the most appropriate spec:
combined = alpha * img_score + (1 - alpha) * text_score
The CoT reasoning is also logged for interpretability.
builder.add_node("Embed", node_embed)
builder.add_node("Rerank", node_rerank_llm)
builder.add_node("读取", node_read)
builder.set_entry_point("Embed")
builder.add_edge("嵌入", "重新排序")
builder.add_edge("重新排序", "读取")
builder.set_finish_point("读取")
graph = builder.compile()
builder.add_node("Embed", node_embed)
builder.add_node("Rerank", node_rerank_llm)
builder.add_node("Read", node_read)
builder.set_entry_point("Embed")
builder.add_edge("Embed", "Rerank")
builder.add_edge("Rerank", "Read")
builder.set_finish_point("Read")
graph = builder.compile()
def execute_graph_agent(user_query: str, image_vec: list[float] | None = None) -> str:
res = graph.invoke({"input": user_query, "image_vec": image_vec})
...
def execute_graph_agent(user_query: str, image_vec: list[float] | None = None) -> str:
res = graph.invoke({"input": user_query, "image_vec": image_vec})
...
它返回一个格式化的输出,其中包括所选规格、图像路径和推理日志。
It returns a formatted output that includes the selected specs, image path, and reasoning log.
该代码库提供了一个模块化程度高且可扩展的多模态信息检索和推荐系统。值得注意的是,langgraph_agent.py集成了 LangGraph,用于协调语义检索和 CoT 重排序流程,从而使系统能够在用户查询和多模态内容之间生成可解释且稳健的匹配结果。加载器、嵌入模块、索引和重排序之间的职责分离确保了系统在笔记本电脑以外的各种领域(例如电子商务、教育或医疗保健)的可重用性和可维护性。
The codebase presents a well-modularized and extensible system for multimodal information retrieval and recommendation. Notably, langgraph_agent.py integrates LangGraph for orchestrating a pipeline with semantic retrieval and CoT reranking, thus allowing the system to produce explainable and robust matches between user queries and multimodal content. The separation of concerns across loaders, embedding modules, indexing, and reranking ensures reusability and maintainability across various domains beyond laptops, such as e-commerce, education, or healthcare.
完整的代码可以在本书的 GitHub 存储库中找到,位于推荐阶段的推理部分。
The full code can be found in the GitHub repository of this book, under the section reasoning at the recommendation stage.
在探索了如何在多模态 GenAI 系统中通过 CoT 推理重排序器在重排序阶段利用推理之后,我们现在将注意力转移到流程中另一个同样重要的阶段——推荐。
Having explored how reasoning is leveraged during the reranking phase through a CoT reasoning reranker in a multimodal GenAI system, we now shift our focus to a different but equally critical stage in the pipeline, recommendation.
在下一个场景中,推理在综合多个异构数据集的见解以生成个性化和情境感知推荐方面发挥着关键作用。
In this next scenario, reasoning plays a pivotal role in synthesizing insights across multiple heterogeneous datasets to generate personalized and context-aware recommendations.
下图展示了个性化 RAG 流程的完整流程,该流程将结构化目录数据、用户偏好配置文件和元数据集成到一个统一的矢量数据库中。所有数据集均被分块,使用共享的嵌入模型进行嵌入,并存储在矢量存储库中。在查询时,系统执行混合检索(结合了最佳匹配 25 ( BM25 ) 和密集矢量搜索),然后使用基于交叉编码器的重排序器进行细粒度评分。
The following figure illustrates the complete flow of a personalized RAG pipeline that integrates structured catalogue data, user preference profiles, and metadata into a unified vector database. All datasets are chunked, embedded using a shared embedding model, and stored in a vector store. At query time, the system performs hybrid retrieval (combining Best Matching 25 (BM25) and dense vector search), followed by a cross-encoder-based reranker for fine-grained scoring.
Figure 13.3: Architecture for reasoning at the recommendation stage
本代码使用了三个数据集,与第 12 章“高级多模态 GenAI 系统”中的代码一起分享,其中包含 CoT、推理和重排序推荐引擎以及 LLM,代码第 2 部分。
Three datasets were used in this code, which has been shared along with the code in Chapter 12, Advanced Multimodal GenAI Systems, with CoT, reasoning, and reranking Recommendation engine with LLM, code part 2.
以下列表概述了三个不同的数据集:
The following list outlines the three different datasets:
以下目录结构概述了rag_llm_memory_project的架构,这是一个模块化的 RAG 系统,增强了长期记忆、个性化分析和多模态推理能力。每个文件夹封装了流程中的一个关键功能层,从嵌入和检索到编排、推理提示和重排序,从而实现可扩展的、上下文感知的内容推荐和多样化的数据模态。
The following directory structure outlines the architecture of the rag_llm_memory_project, a modular RAG system enhanced with long-term memory, personalized profiling, and multimodal reasoning capabilities. Each folder encapsulates a key functional layer of the pipeline, from embedding and retrieval to orchestration, reasoning prompts, and reranking, enabling scalable, context-aware content recommendations and diverse data modalities.
Figure 13.4: Folder structure for reasoning at the recommendation stage
该推荐引擎的目标是通过CoT推理过程解读自然语言提示,从而提供符合上下文的内容。以下展示了一个基于用户提示的典型执行场景:
The goal of this recommendation engine is to deliver contextually appropriate content by interpreting natural language prompts through a CoT reasoning process. The following illustrates a representative execution scenario based on the user prompt:
用户输入提示:
User input prompt:
我正在寻找符合我当前心情(有点怀旧)的内容,我想和我17岁的女儿一起观看。
I’m looking for content that matches my mood, which is currently nostalgic, and I want to watch with my 17-year-old daughter.
以下列表概述了系统的推理和执行步骤:
The following list outlines the system's reasoning and execution steps:
1. 情绪识别:系统解读了查询的情绪基调,并将用户当前的情绪归类为怀旧。
1. Mood identification: The system interpreted the emotional tone of the query and categorized the user's current mood as nostalgic.
2. 受众分析:助手意识到内容必须既适合用户也适合 17 岁的观众,因此强制执行适合家庭观看的限制。
2. Audience analysis: The assistant recognized that the content must be suitable for both the user and a 17-year-old viewer, thus enforcing family-friendly constraints.
3. 类型映射:怀旧的情绪通过算法与成长题材子类型相关联,而成长题材子类型通常与反思和情感主题相一致。
3. Genre mapping: The mood nostalgic, was algorithmically associated with the coming-of-age sub-genre, which typically aligns with reflective and emotional themes.
4. 人口统计兼容性:该助手优先考虑具有跨世代吸引力的内容,以引起青少年和成年观众共鸣的故事为目标。
4. Demographic compatibility: The assistant prioritized content with cross-generational appeal, targeting narratives resonant with both teenage and adult audiences.
5. 用户偏好分析:该引擎参考了一个偏好数据库,检查了 16 岁、52 岁、19 岁、77 岁和 82 岁用户的个人资料,以筛选出那些喜欢家庭和成长内容的用户。
5. User preference profiling: The engine referenced a preference database, examining profiles of users aged 16, 52, 19, 77, and 82 to filter for those favoring family and coming-of-age content.
6. 交叉分析:确定了一个重点用户子集(ID:16、52、19),他们的偏好与类型和受众标准都相符。
6. Intersection analysis: A focused subset of users (IDs: 16, 52, 19) was identified whose preferences aligned with both genre and audience criteria.
7. 主题丰富化:提取了筛选后的用户群体中常见的主题偏好,突出了勇气、爱情和冒险作为主要的叙事元素。
7. Thematic enrichment: Common thematic preferences across the filtered user group were extracted, highlighting courage, love, and adventure as dominant narrative elements.
8. 内容类型过滤:系统根据推断的观看上下文,通过应用逻辑约束( live_event_flag = False )排除直播活动内容。
8. Content-type filtering: The system excluded live event content by applying a logical constraint (live_event_flag = False), in accordance with inferred viewing context.
该系统将查询形式化为以下结构化检索规范:
The system formalized the query as the following structured retrieval specification:
以下是最终输出语句:
The following is the final output statement:
检索以成长题材为重点的非直播家庭内容,融入勇气、爱情和冒险的主题。
Retrieve non-live Family content with a focus on the Coming-of-Age sub-genre, incorporating themes of Courage, Love, and Adventure.
该系统是一个模块化的 RAG 流水线,旨在通过整合结构化用户画像数据、混合检索机制(BM25 + 密集向量)、基于交叉编码器模型的重排序以及 LLM 推理,提供个性化的内容推荐。该流水线基于 LangChain、ChromaDB、Ollama、Transformers 和 PyTorch 构建,支持动态检索、推理以及基于用户特定内存的推荐生成。
This system is a modular RAG pipeline designed to deliver personalized content recommendations by integrating structured profile data, hybrid retrieval mechanisms (BM25 + dense vectors), reranking via cross-encoder models, and LLM reasoning. The pipeline is built using LangChain, ChromaDB, Ollama, Transformers, and PyTorch, enabling dynamic retrieval, reasoning, and user-specific memory-based generation.
以下列表概述了构成 RAG 助手主干的模块化组件,每个组件负责检索和生成工作流程中的特定阶段,从数据加载和向量索引到混合检索、重排序、推理和答案生成:
The following list outlines the modular components forming the backbone of the RAG assistant, each responsible for a specific stage in the retrieval and generation workflow, from data loading and vector indexing to hybrid retrieval, reranking, reasoning, and answer generation:
输出结果是来自对话式推荐系统的结构化推理轨迹,该系统根据用户偏好定制电影推荐。涉及两位用户;详情如下:
The output is a structured reasoning trace from a conversational recommender system that tailors film suggestions based on user preferences. Two users are involved; details as follows:
下图展示了辅助筛选器,它会根据用户的喜好检索并推荐影片。例如,它会推荐《七宝奇谋》、《E.T.外星人》和《回到未来》等经典影片,因为这些影片符合用户怀旧、合家欢和适龄的偏好。
The following figure depicts the assistant filters, which retrieve and present recommendations accordingly. It suggests classics like The Goonies, E.T., and Back to the Future as they align with the nostalgic, family-friendly, and age-appropriate preferences.
Figure 13.5: Output from the GenAI recommendation engine for two users
该系统提出了一种先进的 RAG 流水线,能够无缝集成基于用户画像和基于内容的推荐策略,并通过混合检索和交叉编码器重排序来提升个性化程度,从而提高推荐的相关性。通过整合对话记忆和 ReAct 式提示,该系统能够提供智能的、上下文感知的、根据用户偏好量身定制的响应。此外,该架构的设计具有良好的可扩展性,未来可集成多模态输入或实时流数据源。
This system presents an advanced RAG pipeline that seamlessly integrates profile-based and content-based recommendation strategies, enhancing personalization through the use of hybrid retrieval and cross-encoder reranking for improved relevance. By incorporating conversational memory and ReAct-style prompting, it enables intelligent, context-aware responses tailored to user preferences. Additionally, the architecture is designed for extensibility, allowing future integration with multimodal inputs or real-time streaming data sources.
本章概述了GenAI系统中推理的基本原理和实际应用。通过考察提示技术,我们重点阐述了如何从语言模型中提取结构化推理以增强决策能力。关于重排序架构的讨论展示了推理如何改进相关输出的选择,而对推荐阶段推理的探索则展示了如何整合各种数据源以有效地实现内容个性化。这些组成部分共同构成了一个统一的框架,用于开发超越表面检索的智能系统,从而实现上下文感知和用户导向的响应。这种理解为设计更强大、更易于解释的AI应用奠定了基础。
This chapter has outlined the foundational principles and practical implementations of reasoning within GenAI systems. By examining prompting techniques, we highlighted how structured reasoning can be elicited from language models to enhance decision-making. The discussion on reranking architectures demonstrated how reasoning can improve the selection of relevant outputs, while the exploration of recommendation stage reasoning illustrated the integration of diverse data sources to personalize content effectively. Together, these components form a cohesive framework for developing intelligent systems that go beyond surface-level retrieval, enabling context-aware, user-aligned responses. This understanding sets the stage for designing more robust and interpretable AI applications.
下一章我们将了解其他主题,例如文本转 SQL。
In the next chapter, we will understand other topics like text-to-SQL.
在数据驱动决策时代,使用自然语言与数据库交互的能力已成为一项变革性技术。文本到结构化查询语言(SQL )是自然语言处理(NLP )的一个分支,它使用户能够将简单的英语查询转换为结构化的SQL命令,从而使即使是非技术用户也能轻松获取复杂的数据洞察。本章将探讨文本到SQL系统的基本原理、系统设计、实际应用和具体实现,特别是那些基于大型语言模型(LLM )的系统。
In the age of data-driven decision-making, the ability to interact with databases using natural language has emerged as a transformative capability. Text-to-Structured Query Language (SQL), a branch of natural language processing (NLP), enables users to translate plain English queries into structured SQL commands, allowing even non-technical users to access complex data insights with ease. This chapter explores the fundamental principles, system design, real-world applications, and practical implementation of text-to-SQL systems, particularly those powered by large language models (LLMs).
本章首先介绍文本到 SQL 的基本概念,包括自然语言理解、模式链接和 SQL 查询生成。然后,我们将探讨现代文本到 SQL 系统的架构基础,重点介绍模式感知提示、语言学习模型 (LLM) 和基于工具的编排的作用。本章还将讨论各种应用,从商业智能( BI ) 仪表板到语音分析助手。
We begin by introducing the basic concepts underpinning text-to-SQL, including natural language understanding, schema linking, and SQL query generation. We will then examine the architectural foundations of modern text-to-SQL systems, highlighting the role of schema-aware prompting, LLMs, and tool-based orchestration. The chapter also discusses various applications, from business intelligence (BI) dashboards to voice-enabled analytics assistants.
尽管文本转SQL前景广阔,但它也面临着一些独特的挑战,例如处理歧义查询、确保SQL语句的有效性以及将自然语言与复杂的数据库模式相匹配。本文最后提供了一份实用的实施指南,概述了快速设计、模式集成、验证技术和评估指标等方面的策略。
Despite its promise, text-to-SQL poses unique challenges, including handling ambiguous queries, ensuring SQL validity, and aligning natural language with complex database schemas. We conclude with a practical implementation guide, outlining strategies for prompt design, schema integration, validation techniques, and evaluation metrics.
读完本章,读者将全面了解自然语言界面如何彻底改变数据库的可访问性,并增强整个组织的数据素养。
By the end of this chapter, readers will gain a comprehensive understanding of how natural language interfaces can revolutionize database accessibility and empower broader data literacy across organizations.
本章涵盖以下主题:
This chapter covers the following topics:
本章的主要目标是帮助读者建立对文本到 SQL 系统的基础知识,使他们能够掌握如何使用基于现代语言学习模型 (LLM) 的技术将自然语言输入转换为可执行的 SQL 查询。本章将探讨核心概念、系统架构、实际应用、实现策略和评估指标,旨在提供清晰的概念阐释和实用的指导。读者将获得在各自领域内设计、评估或扩展文本到 SQL 管道所需的知识,为下一章讨论的更高级的基于代理的系统做好准备。这将为构建智能的交互式数据访问工作流奠定基础。
The primary objective of this chapter is to equip readers with a foundational understanding of text-to-SQL systems, enabling them to grasp how natural language inputs can be transformed into executable SQL queries using modern LLM-based techniques. By exploring core concepts, system architecture, practical applications, implementation strategies, and evaluation metrics, this chapter aims to provide both conceptual clarity and practical guidance. Readers will gain the necessary knowledge to design, evaluate, or extend text-to-SQL pipelines within their own domains, preparing them for more advanced, agent-based systems discussed in the next chapter. This sets the stage for intelligent, interactive data access workflows.
尽管生成式人工智能(GenAI )取得了快速发展,尤其是在自然语言处理(NLP)和代码生成方面,但将自然语言翻译成SQL(通常称为文本到SQL )的任务仍然是该领域最具挑战性和最复杂的问题之一。虽然诸如生成式预训练Transformer (GPT )之类的语言学习模型(LLM)显著提高了机器的流畅性和上下文理解能力,但它们仍然难以应对SQL生成的精确性、结构化和领域特定性。此外,一系列实际和理论上的挑战也加剧了这一困难,使得文本到SQL系统的广泛部署并非易事,尤其是在企业环境中。
Despite the rapid advancements in generative AI (GenAI), particularly in NLP and code generation, the task of translating natural language into SQL, commonly referred to as text-to-SQL, remains one of the most challenging and nuanced problems in the field. While LLMs such as Generative Pre-trained Transformer (GPT) have significantly improved the fluency and contextual understanding of machines, they still struggle with the precise, structured, and domain-specific nature of SQL generation. This difficulty is compounded by a range of practical and theoretical challenges that make widespread deployment of text-to-SQL systems non-trivial, particularly in enterprise settings.
其中一个根本挑战在于自然语言和结构化数据模式之间的不匹配。人类语言本质上是含糊不清、富含语境的,而且往往……文本转SQL并不完整,而SQL则需要与特定数据库模式相匹配的精确、确定性的规范。用户可能会使用同义词、缩写或特定于业务的术语来引用列或表,这些引用方式可能与模式不完全一致,这就要求模型不仅要理解用户的意图,还要将其准确地映射到数据库结构。这个问题被称为模式链接,它仍然是构建健壮的文本转SQL系统的核心瓶颈之一。
One of the fundamental challenges lies in the misalignment between natural language and structured data schemas. Human language is inherently ambiguous, context-rich, and often incomplete, whereas SQL requires exact, deterministic specifications that match the schema of a particular database. Users may refer to columns or tables in ways that do not directly align with the schema, using synonyms, abbreviations, or business-specific terminology, which requires the model not only to understand the intent but also to map it accurately to the database structure. This issue, known as schema linking, remains one of the core bottlenecks in building robust text-to-SQL systems.
此外,并非所有组织的数据都适合基于 GenAI 的查询。大多数企业数据库的设计都以性能和与旧系统兼容性为导向,而非语义可访问性。它们可能缺乏完善的文档,使用不一致的命名约定,或者包含即使对于经验丰富的工程师也难以理解的深度嵌套模式。如果没有清晰、结构良好且注释丰富的元数据,即使是最强大的语言学习模型 (LLM) 也难以生成有效且上下文准确的 SQL 查询。企业数据环境中 GenAI 准备不足的现状严重限制了文本到 SQL 系统在许多组织中的实际应用。
Furthermore, not every organization’s data is ready for GenAI-based querying. Most enterprise databases are designed for performance and legacy compatibility, not for semantic accessibility. They may lack proper documentation, use inconsistent naming conventions, or contain deeply nested schemas that are hard to interpret even for experienced engineers. Without clean, well-structured, and richly annotated metadata, even the most powerful LLMs struggle to produce valid and contextually accurate SQL queries. This lack of GenAI readiness in corporate data environments severely limits the practical applicability of text-to-SQL systems in many organizations.
另一个挑战是跨领域通用性不足。虽然在Spider或WikiSQL等基准数据集上进行微调的学习学习模型 (LLM)在学术环境中表现良好,但当应用于模式设计、数据质量或业务逻辑存在差异的真实数据库时,其有效性会显著下降。特定领域的细微差别通常需要定制提示、使用专有数据进行微调以及融入领域知识,这会增加开发复杂性并降低可扩展性。
Another challenge is the lack of generalizability across domains. While LLMs fine-tuned on benchmark datasets like Spider or WikiSQL perform reasonably well in academic settings, their effectiveness drops significantly when applied to real-world databases that differ in schema design, data quality, or business logic. Domain-specific nuances often require customization of prompts, fine-tuning on proprietary data, and the inclusion of domain knowledge, which increases development complexity and reduces scalability.
此外,确保生成的 SQL 语句的正确性和安全性也存在重大风险。不正确或格式错误的 SQL 查询可能导致性能下降、隐私泄露,甚至在涉及写入操作时造成数据损坏。验证 LLM 的输出需要执行时检查、权限约束,理想情况下还需要人机交互( HITL ) 系统,所有这些都会引入延迟和运维开销。
Additionally, ensuring the correctness and safety of the generated SQL poses a significant risk. Incorrect or malformed SQL queries can lead to performance degradation, privacy violations, or even data corruption if write operations are involved. Validating the output of LLMs requires execution-time checks, permission constraints, and ideally a human-in-the-loop (HITL) system, all of which introduce latency and operational overhead.
总而言之,尽管 GenAI 为自然语言理解和生成带来了前所未有的能力,但 SQL 生成的结构化、上下文相关和高风险特性使得文本到 SQL 的转换仍然是一个长期存在的难题。在文本到 SQL 能够在企业环境中得到广泛、可靠的应用之前,必须认真解决模式对齐、数据准备、领域泛化和执行安全性等挑战。
In summary, while GenAI has brought unprecedented capabilities to natural language understanding and generation, the structured, context-specific, and high-stakes nature of SQL generation makes text-to-SQL an enduringly difficult problem. The challenges of schema alignment, data readiness, domain generalization, and execution safety must all be carefully addressed before text-to-SQL can achieve widespread, reliable adoption in enterprise environments.
文本到 SQL 转换是指将自然语言查询转换为可在关系数据库上执行的 SQL 语句。其目标是使非技术用户无需精通 SQL 语法或深入了解底层数据模式即可与数据库交互。此转换涉及多个关键组件,包括自然语言理解、模式链接、语义解析和 SQL 查询生成。
Text-to-SQL refers to the task of translating natural language queries into SQL statements that can be executed on relational databases. The goal is to enable non-technical users to interact with databases without needing expertise in SQL syntax or a deep understanding of the underlying data schema. This transformation involves several key components, including natural language understanding, schema linking, semantic parsing, and SQL query generation.
自然语言理解是系统解读用户通过人类语言表达的意图的初始阶段。例如,如果用户问“总销售额是多少?”2023 年每个地区的销售额是多少?系统必须识别诸如销售额、地区和时间约束 2023 年之类的实体。这需要句法分析(例如,词性标注、依存句法分析)和语义解释(例如,识别出“总计”意味着聚合函数)。
Natural language understanding is the initial phase where the system interprets the user's intent conveyed through human language. For instance, if a user asks, what are the total sales for each region in 2023? The system must identify entities such as sales, region, and the temporal constraint 2023. This requires both syntactic analysis (e.g., parts-of-speech tagging, dependency parsing) and semantic interpretation (e.g., recognizing that total implies an aggregation function).
模式链接是数据库操作中一个至关重要的环节,它涉及将自然语言元素与数据库模式组件进行匹配。在实践中,这需要将“总销售额”等短语映射到销售表中的特定列,并将“区域”映射到同一表中的列或相关区域表中的列。有效的模式链接通常涉及同义词解析、实体识别和歧义消除,这些在异构或文档不完善的数据库中并非易事。模式链接可以分为显式(查询词与模式词直接匹配)、隐式(需要基于上下文进行推理)和模糊(处理模糊或歧义的指代)。
Schema linking is a fundamental aspect that involves aligning the natural language elements with the database schema components. In practice, this requires mapping phrases like total sales to a specific column in a sales table, and region to either a column in the same table or in a related regions table. Effective schema linking often involves synonym resolution, entity recognition, and disambiguation, which are non-trivial in heterogeneous or poorly documented databases. Schema linking can be categorized into explicit (direct matches between query and schema terms), implicit (requiring inference based on context), and fuzzy (handling vague or ambiguous references).
语义解析是将解释后的自然语言转换为结构化逻辑形式(例如抽象语法树或逻辑查询计划)的过程。这种表示形式以可转换为 SQL 的格式捕获了用户请求的语义。不同的解析技术包括基于规则的系统、统计模型以及神经网络方法(例如带有注意力机制的编码器-解码器架构)。
Semantic parsing is the process of converting the interpreted natural language into a structured logical form, such as an abstract syntax tree or logical query plan. This representation captures the semantics of the user’s request in a format that can be translated into SQL. Different parsing techniques include rule-based systems, statistical models, and neural approaches such as encoder-decoder architectures with attention mechanisms.
SQL 生成涉及将逻辑形式映射到可执行的 SQL 查询。这包括确定合适的 SQL 子句(SELECT 、FROM 、WHERE 、GROUP BY等)、解析相关表之间的连接、应用聚合函数以及确保正确的筛选条件。例如,自然语言问题“哪些产品在 2023 年 1 月的销量超过 1000 件?”将被翻译成:
SQL generation involves mapping the logical form into an executable SQL query. This includes determining the appropriate SQL clauses (SELECT, FROM, WHERE, GROUP BY, etc.), resolving joins between related tables, applying aggregation functions, and ensuring correct filtering conditions. For example, the natural language question, which products sold more than 1,000 units in January 2023? Would be translated into:
SELECT product_name FROM sales WHERE units_sold > 1000 AND sale_date BETWEEN '2023-01-01' AND '2023-01-31';
SELECT product_name FROM sales WHERE units_sold > 1000 AND sale_date BETWEEN '2023-01-01' AND '2023-01-31';
这种转变表明,需要在人类意图和机器可读语法之间建立精确的映射关系。
This transformation shows the need for precise mapping between human intent and machine-readable syntax.
从历史上看,文本到 SQL 的系统最初是基于规则或模板驱动的方法,依赖于手工编写的语法和有限的词汇表。这些系统缺乏可扩展性和跨领域的适应性。机器学习(ML ),尤其是深度学习的引入,标志着向更灵活、数据驱动的方法转变。序列到序列(Seq2Seq )模型、注意力机制以及最近出现的诸如 GPT 和 Codex 等语言学习模型(LLM)的应用,极大地推动了该领域的发展。
Historically, text-to-SQL systems started as rule-based or template-driven methods that relied on handcrafted grammars and limited vocabularies. These systems lacked scalability and adaptability across domains. The introduction of machine learning (ML), especially deep learning, marked a shift toward more flexible, data-driven approaches. The use of sequence-to-sequence (Seq2Seq) models, attention mechanisms, and, more recently, LLMs such as GPT and Codex, has significantly advanced the state of the art.
还必须考虑不同类型的 SQL 查询。简单的查询涉及SELECT或WHERE子句,而更复杂的查询则涉及连接、聚合、嵌套子查询、窗口函数以及集合运算(例如UNION或INTERSECT)。理解这些类型对于满足各种用户意图至关重要。
Different types of SQL queries must also be considered. Simple queries involve SELECT or WHERE clauses, but more complex queries involve joins, aggregations, nested subqueries, window functions, and set operations like UNION or INTERSECT. Understanding these types is essential to cover a wide range of user intents.
文本到 SQL 系统可以根据其训练过程中监督的程度进行分类:完全监督系统需要配对的自然语言和 SQL 示例;弱监督系统依赖于间接监督(例如,执行结果);以及无监督系统。尝试在没有明确训练示例的情况下学习映射关系。另一种有用的分类方法是基于交互风格,即单次查询与支持后续提问和澄清的多轮对话系统。
Text-to-SQL systems can be categorized based on the level of supervision in their training: fully supervised systems require paired natural language and SQL examples; weakly supervised systems rely on indirect supervision (e.g., execution results); and unsupervised systems attempt to learn mappings without explicit training examples. Another useful classification is based on the interaction style, single-shot queries vs. multi-turn dialogue systems that support follow-up questions and clarification.
在模式表示方面,系统必须处理各种复杂情况,包括扁平模式(单表)、层次模式(父子关系)和关系图(带外键的多表数据库)。以语言学习模型(LLM)能够理解的方式表示模式,例如序列化表模式、表实体图或嵌入,对于准确生成查询至关重要。
In terms of schema representation, systems must handle various complexities, including flat schemas (single table), hierarchical schemas (parent-child relationships), and relational graphs (multi-table databases with foreign keys). Representing the schema in a way that LLMs can understand, such as serialized table schemas, table-entity graphs, or embeddings, is crucial for accurate query generation.
在现代生成人工智能(GenAI)领域,大型预训练模型已被证明能够有效地理解和生成 SQL 查询。然而,它们仍然严重依赖于提示质量和模式感知能力。提示工程、检索增强生成(RAG )以及基于工具的增强(例如,调用应用程序编程接口(API )函数)等技术通常用于提高准确性和泛化能力。
In modern GenAI contexts, large pre-trained models have proven effective in understanding and generating SQL queries. However, they still depend heavily on prompt quality and schema-awareness. Techniques such as prompt engineering, retrieval-augmented generation (RAG), and tool-based augmentation [e.g., function calling application programming interface (APIs)] are commonly used to improve accuracy and generalizability.
文本转 SQL 的应用场景涵盖各个领域。在金融领域,用户可以查询交易量或平均收入。在医疗保健领域,医生可能需要按病情或时间段筛选患者数据。在教育领域,学生可以通过比较自然语言和正式查询语句来学习 SQL。在公共数据访问领域,公民可以使用自然语言提问,从开放的政府数据库中获取信息。
Use cases for text-to-SQL span across domains. In finance, users may query transaction volumes or average revenue. In healthcare, physicians might ask for patient data filtered by conditions or timeframes. In education, students can learn SQL by comparing natural language and formal query pairs. In public data access, citizens can ask natural language questions to extract insights from open government databases.
尽管技术不断进步,文本到 SQL 系统中常见的错误仍然包括语义漂移(生成的 SQL 与原始意图不符)、错误的表或列引用,以及对过滤器或约束的误解。要缓解这些问题,需要强大的模式链接、对语言的深刻理解以及动态验证机制。
Despite advancements, common errors in text-to-SQL systems include semantic drift (where the generated SQL does not match the original intent), incorrect table or column references, and misinterpretation of filters or constraints. Mitigating these issues requires robust schema linking, strong language understanding, and dynamic validation mechanisms.
理解文本到 SQL 的基本概念需要剖析将人类语言解析、链接和翻译成 SQL 的多步骤过程。随着 GenAI 模型的发展,这些系统有望变得更加易用、适应性更强、准确性更高,但其基本原理对于成功实施仍然至关重要。
Understanding the basic concepts of text-to-SQL involves dissecting the multi-step process of parsing, linking, and translating human language into SQL. As GenAI models evolve, these systems are poised to become more accessible, adaptable, and accurate, but the foundational principles remain critical for successful implementation.
文本转 SQL 系统的实用价值遍及各行各业,它通过自然语言界面,使用户能够更直观、高效地访问数据。随着企业越来越多地采用数据驱动的决策流程,非技术利益相关者直接与结构化数据库交互的需求变得至关重要。文本转 SQL 提供了一种弥合这一差距的机制,有助于促进包容性,并使数据洞察的获取更加民主化。以下是文本转 SQL 正在产生重大影响的关键领域:
The practical value of text-to-SQL systems extends across a wide spectrum of industries, enabling more intuitive and efficient access to data through natural language interfaces. As organizations increasingly adopt data-driven decision-making processes, the need for non-technical stakeholders to interact directly with structured databases becomes critical. Text-to-SQL provides a mechanism for bridging this gap, fostering inclusivity, and democratizing access to insights. The following are the key domains where text-to-SQL is making a significant impact:
文本转 SQL 的实际应用范围广泛且不断扩展。从赋能内部利益相关者到改善公众数据访问,这些系统正引领着数据生态系统向更具包容性和智能化的方向发展。对于拥有庞大且异构数据集的组织而言,其影响尤为显著,因为手动编写 SQL 脚本的繁琐操作会阻碍决策。通过将自然语言界面嵌入分析工作流程,组织可以实现更广泛的应用、更深入的洞察,并加快从问题到答案的转化路径。
The real-world applications of text-to-SQL are vast and growing. From empowering internal stakeholders to improving public access to data, these systems are at the forefront of a shift toward more inclusive and intelligent data ecosystems. Their impact is particularly significant in organizations with large, heterogeneous datasets, where the friction of manual SQL scripting hinders decision-making. By embedding natural language interfaces into analytics workflows, organizations can unlock broader usage, deeper insights, and a faster path from question to answer.
尽管文本到 SQL 系统为自然语言和结构化数据库之间提供了一个强大的接口,但这项任务仍然面临着诸多技术和实践上的挑战。这些挑战既源于自然语言固有的歧义性,也源于 SQL 的僵化性。理解这些挑战对于设计健壮、可扩展且适用于企业级应用的文本到 SQL 系统至关重要。
While text-to-SQL systems offer a powerful interface between natural language and structured databases, the task remains fraught with substantial technical and practical challenges. These challenges arise from both the inherent ambiguity of natural language and the rigidity of SQL. Understanding these challenges is essential to designing robust, scalable, and enterprise-ready text-to-SQL systems.
以下列举了该领域面临的最主要障碍:
The following list explores the most significant obstacles faced in this domain:
例如,考虑以下问题:显示上个季度表现最佳的地区。“表现最佳”可以指收入、利润率、客户满意度或其他指标。同样,“上个季度”必须相对于当前日期,这就需要时间上下文。如果没有明确的说明,即使是高级的LLM(生命周期管理)也可能难以生成准确的SQL查询。
多轮对话中的代词和省略号带来了另一层复杂性。例如,在一段对话中,用户先问“列出所有在欧洲销售的产品” ,然后又追问“哪些产品的销量正在下降?”。模型必须维护上下文记忆,并将其解析为正确的实体集。这项任务超越了句法翻译的范畴,进入了对话建模和共指消解的领域。
For instance, consider the question, show me the top-performing regions last quarter. The term top-performing could refer to revenue, profit margin, customer satisfaction, or some other metric. Similarly, last quarter must be resolved relative to the current date, requiring temporal context. In the absence of explicit clarification, even advanced LLMs may struggle to generate accurate SQL queries.
Another layer of complexity arises from pronouns and ellipsis in multi-turn dialogues. In a conversation where a user first asks, list all products sold in Europe, and then follows up with which of them had declining sales? The model must maintain contextual memory and resolve them to the correct entity set, a task that goes beyond syntactic translation and enters into the realm of dialogue modeling and co-reference resolution.
模式链接涉及将诸如“收入最高的员工”之类的表达式解析为数据库中类似employee.salary 的值。复杂性会随着以下因素而增加:
这需要对语义有深刻的理解,并且通常需要将模式上下文嵌入到提示或模型输入中,以支持准确的基础。
Schema linking involves resolving expressions like the highest earning employee to something like employee.salary in the database. The complexity increases with the following:
This necessitates deep semantic understanding and often requires embedding the schema context into the prompt or model input in a way that supports accurate grounding.
即使是像 GPT-4 这样的语言学习模型,如果没有模式条件或针对领域相关查询的微调,也会表现不佳。这限制了文本到 SQL 的开箱即用性。解决方案需要领域适应技术,例如 RAG、模式预嵌入和提示工程,并提供特定领域的示例。
Even LLMs such as GPT-4 can falter without schema conditioning or fine-tuning on domain-relevant queries. This limits the out-of-the-box utility of text-to-SQL solutions and necessitates domain adaptation techniques such as RAG, schema pre-embedding, and prompt engineering with domain-specific examples.
除了语法之外,逻辑有效性也是一大挑战。查询可能运行无误,但却返回错误或误导性的结果。例如,GROUP BY子句位置错误或缺少HAVING筛选器都可能改变查询的语义,从而导致一些不易察觉的分析错误。
Beyond syntax, logical validity is another challenge. A query might run without error but return incorrect or misleading results. For example, an incorrectly placed GROUP BY clause or a missing HAVING filter can change the semantics of the query, resulting in analytics errors that may go unnoticed.
此外,查询执行需要实时访问数据库,这使得这些系统的培训、调试和部署变得更加复杂。虽然通常需要离线验证环境或测试沙箱,但它们并不总是能准确地复制生产环境的模式或数据量。
Additionally, query execution requires live database access, which complicates the training, debugging, and deployment of these systems. Offline validation environments or test sandboxes are often required, but they do not always replicate the production schema or data volume accurately.
设想这样一段对话:
每条语句都依赖于前一条语句所建立的上下文。在对话回合中维护不断演变的查询结构、过滤条件和目标表引用,是一项重大的架构和建模挑战。这需要具备内存感知能力的系统,能够维护和更新查询状态,或构建对话的语义图。
Consider a dialogue like:
Each utterance depends on the context established by the previous ones. Maintaining the evolving query structure, filtering criteria, and target table references across turns is a significant architectural and modeling challenge. It requires memory-aware systems capable of maintaining and updating query state or constructing semantic graphs of the conversation.
此外,如何利用用户的纠正、错误或认可信号构建反馈回路仍然是一个开放的研究领域。结合基于人类反馈的强化学习(RLHF )、置信度评分和回退机制可以帮助提高可靠性,但也会增加设计的复杂性。
Moreover, building feedback loops from user corrections, errors, or approval signals remains an open research area. Incorporating reinforcement learning from human feedback (RLHF), confidence scoring, and fallback mechanisms can help improve reliability but introduce further design complexity.
否则可能导致数据泄露、审计失败或违反监管规定。这为系统设计增加了一层责任,而不仅仅局限于模型准确性。
Failure to do so could result in data leaks, audit failures, or regulatory violations. This adds another layer of responsibility to system design, beyond just model accuracy.
意图 消歧策略包括以下几种:
这些策略必须在用户体验(UX )(保持交互效率)与可解释性和正确性之间取得平衡。
o Intent disambiguation strategies include the following:
These strategies must balance user experience (UX) (keeping interactions efficient) with interpretability and correctness.
尽管文本到 SQL 代表了一种连接人类语言和结构化数据库的极具前景的接口,但其在实际应用中仍面临诸多技术、语言和组织方面的挑战。从解决自然语言歧义到确保 SQL 的安全性和执行正确性,从用户提问到可执行查询的整个过程都充满了潜在的故障点。应对这些挑战需要在模型设计、模式表示、领域自适应和以用户为中心的交互设计方面取得进展。只有采用融合人工智能、数据工程和用户体验的整体方法,才能开发出强大且值得信赖的文本到 SQL 系统。
While text-to-SQL represents a promising interface between human language and structured databases, its implementation in real-world settings is constrained by a host of technical, linguistic, and organizational challenges. From resolving natural language ambiguity to ensuring SQL safety and execution correctness, the path from user question to executable query is fraught with potential failure points. Addressing these challenges requires advances in model design, schema representation, domain adaptation, and user-centered interaction design. Only through a holistic approach that blends AI, data engineering, and UX considerations can robust and trustworthy text-to-SQL systems be developed.
使用现代语言模型实现一个稳健的文本到 SQL 系统需要精心协调多个组件,涵盖提示设计和模式集成、输出验证以及系统监控等各个方面。虽然像 GPT-4 这样的语言模型极大地提高了数据库自然语言接口的可行性,但其原始输出必须经过严格的控制、处理和评估,以确保在实际环境中的正确性和安全性。本节提供了一个全面的分步指南,指导您如何实现此类系统,重点介绍基于当前行业实践的实用策略。
Implementing a robust text-to-SQL system using modern language models requires careful orchestration of multiple components, ranging from prompt design and schema integration to output validation and system monitoring. While LLMs like GPT-4 have dramatically improved the feasibility of natural language interfaces for databases, their raw outputs must be carefully controlled, conditioned, and evaluated to ensure both correctness and safety in real-world settings. This section provides a comprehensive, step-by-step guide for implementing such systems, with an emphasis on pragmatic strategies grounded in current industry practices.
下图展示了现代文本到 SQL 系统的高级架构,重点突出了从语言模型提示到 SQL 验证和可观测性的关键阶段。它涵盖了构建健壮可靠的自然语言到 SQL 接口所必需的关键组件,例如模式集成、用户澄清、回退机制、多模态扩展和反馈循环。
The following figure illustrates a high-level architecture of a modern text-to-SQL system, highlighting the critical stages from language model prompting to SQL validation and observability. It captures key components such as schema integration, user clarification, fallback mechanisms, multimodal extensions, and feedback loops essential for building robust and reliable natural language to SQL interfaces.
以下是对图 14.1的解释:
The following is an explanation of Figure 14.1:
1. 提示策略和语言模型条件化:提示设计是文本到 SQL 实现的关键部分。由于语言模型 (LLM) 采用零样本或少样本范式,精心构建的提示可以显著影响其将自然语言翻译成正确 SQL 的能力。
1. Prompting strategies and language model conditioning: Prompt engineering is a critical part of text-to-SQL implementation. Since LLMs operate within a zero-shot or few-shot paradigm, carefully constructed prompts can significantly influence their ability to translate natural language to correct SQL.
• 零样本提示:这种方法假设模型已经过 SQL 模式的预训练。基本的提示可能只是简单地呈现用户查询和数据库模式,然后给出指令:生成相应的 SQL 查询。
• Zero-shot prompting: This approach assumes the model has been pre-trained on SQL patterns. A basic prompt might simply present the user query and database schema, followed by the instruction: Generate the corresponding SQL query.
例子:
Example:
o 输入:列出上个月下单超过 5 次的客户。
o Input: List customers who placed more than 5 orders last month.
o 模式:客户(id ,name ),订单(id ,customer_id ,order_date )
o Schema: Customers (id, name), Orders (id, customer_id, order_date)
该模型必须仅根据上下文推断出正确的连接和时间过滤器。
The model must infer the correct join and time filter from context alone.
• 少样本提示:少样本提示在提示中包含 1-5 个精心挑选的示例,用于说明问题与 SQL 之间的对应关系。这种方法可以提高准确性,尤其是在处理复杂查询时,并且允许注入特定领域的惯用法或业务规则。
• Few-shot prompting: Few-shot prompting includes 1-5 manually curated examples in the prompt to illustrate mappings between questions and SQL. This method improves accuracy, especially for complex queries, and allows injection of domain-specific idioms or business rules.
• 思路链(CoT)提示:对于非常复杂的查询,可以在提示中使用中间推理步骤。例如:首先确定相关表,然后定义筛选条件,最后构建连接。
• Chain of thought (CoT) prompting: For very complex queries, one may use intermediate reasoning steps in the prompt. For instance: first identify relevant tables, then define filters, then compose joins.
COT还支持模块化或代理分解方法,这在具有复杂模式的企业环境中特别有用。
COT also enables a modular or agentic decomposition approach, particularly useful in enterprise settings with complex schemas.
2. 模式和元数据集成:LLM 本身并不了解特定数据库的模式及其元数据,除非显式提供。为了弥补这一不足,必须将模式和元数据嵌入到提示中,或作为上下文传递。
2. Schema and meta data integration: LLMs do not inherently know the schema of a specific database and meta data of the schema unless it is explicitly provided. To bridge this gap, the schema and meta data must be embedded into the prompt or passed as context.
a. 扁平化模式列表:表和列仅列在提示符之前。这种方式适用于小型或中型数据库。
a. Flat schema listing: Tables and columns are simply listed before the prompt. This is effective for small or moderately sized databases.
b. 结构化模式编码:对于较大的模式,尤其是具有多个外键和嵌套连接的情况,结构化表示格式(如 JSON、带注释的模式图或实体关系摘要)可能更有效。
b. Structured schema encoding: For larger schemas, especially with multiple foreign keys and nested joins, structured representation formats like JSON, annotated schema graphs, or entity-relationship summaries can be more effective.
c. 语义模式映射:高级实现使用基于嵌入的语义匹配将用户查询词与模式标签关联起来,识别同义词、缩写和隐式引用。例如,将staff映射到employee或将 revenue映射到sales_amount 。
c. Semantic schema mapping: Advanced implementations use embedding-based semantic matching to relate user query terms with schema labels, identifying synonyms, acronyms, and implicit references. For instance, mapping staff to employe or revenue to sales_amount.
3. 中间规划和分解:某些实现方式可以从中间规划阶段中受益。系统可能不会直接生成 SQL,而是:
3. Intermediate planning and decomposition: Some implementations benefit from intermediate planning stages. Instead of generating SQL directly, the system might:
a. 首先生成自然语言计划(例如,我们需要连接订单和客户,按订单日期筛选,按客户 ID 分组)。
a. First generate a natural language plan (e.g., we need to join Orders and Customers, filter by order_date, group by customer_id).
b. 然后把这个计划转换成 SQL。
b. Then transform that plan into SQL.
c. 这种分解方法允许在每个阶段进行验证,使调试更容易。
c. This decomposition allows for validation at each stage and makes debugging easier.
4. 行和表摘要:行摘要是指生成表格中特定行或记录的文本描述。这通常涉及识别关键值、关系或异常情况,并用流畅自然的语言表达出来。表摘要则侧重于生成关于整个数据集的简洁叙述或见解,例如跨多行和多列的趋势、聚合、分布或异常值。
4. Row and table summarization: Row summarization refers to generating a textual description of a specific row or record in a table. This often involves identifying key values, relationships, or anomalies and expressing them in fluent natural language. Table summarization focuses on producing concise narratives or insights about an entire dataset, such as trends, aggregates, distributions, or outliers across multiple rows and columns.
一个。 行汇总工作流程:
a. Row summarization workflow:
系统首先解读用户的自然语言提示,并识别目标行(通过 SQL 过滤或查找)。然后,摘要模块(基于规则或由语言学习模型驱动)使用选定的字段和值生成叙述性文本。
The system first interprets the user’s natural language prompt and identifies the target row (via SQL filtering or lookup). Then, a summarization module, either rule-based or powered by LLMs, generates a narrative using selected fields and values.
i. 输入提示:总结四月份最畅销的产品。
i. Input prompt: Summarize the top-selling product in April.
ii. 行输出:{产品:'智能手表 X',销售额:15,300,地区:'北美'}。
ii. Row output: {Product: 'Smartwatch X', Sales: 15,300, Region: 'North America'}.
iii. 总结:Smartwatch X 是 4 月份最畅销的产品,在北美售出 15,300 台。
iii. Summary: Smartwatch X was the best-selling product in April, with 15,300 units sold in North America.
b. 表格汇总工作流程:
b. Table summarization workflow:
执行返回多行的 SQL 查询后,系统会识别关键指标(平均值、趋势、众数和异常值),并生成摘要。
After executing a SQL query that returns multiple rows, the system identifies key metrics (averages, trends, modes, and anomalies) and generates a summary.
i. 输入提示:请提供季度销售汇总信息。
i. Input prompt: Give me a summary of quarterly sales.
二、 概要:销售额在各季度稳步增长,第四季度达到峰值,营收达320万美元。北部地区的业绩始终优于其他地区。
ii. Summary: Sales increased steadily over the quarters, peaking in Q4 with $3.2M in revenue. The North region consistently outperformed other regions.
5. SQL 输出验证和安全性:LLM 生成的 SQL 可能存在语法或逻辑错误。因此,在执行查询之前对其进行验证至关重要。
5. SQL output validation and safety: SQL generated by LLMs can be syntactically or logically incorrect. It is important to validate generated queries before execution.
• 静态分析:应用 SQL 解析器检查语法正确性。诸如 SQLparse 之类的工具或特定于方言的验证器可以捕获基本错误。
• Static analysis: Apply a SQL parser to check for syntax correctness. Tools like SQLparse or dialect-specific validators can catch basic errors.
• 模式感知验证:交叉检查引用的表和列是否存在于目标数据库模式中。
• Schema-aware validation: Cross-check whether referenced tables and columns exist in the target database schema.
• 逻辑验证:一些系统会在示例数据库上执行测试查询,或者将执行限制为只读视图,以防止出现意外的副作用。
• Logical validation: Some systems implement test queries on a sample database or restrict execution to read-only views to prevent unintended side effects.
6. 多模态和工具增强型扩展:近期的系统探索了混合架构,其中语言模型(LLM)通过函数调用 API 或插件与外部工具或数据库进行交互。例如,LLM 可以调用 ` get_table_info`工具来动态检索模式元数据,或者使用向量搜索模块来解决歧义的列引用。这些工具增强型 LLM 模糊了静态语言模型和交互式代理之间的界限。
6. Multimodal and tool-augmented extensions: Recent systems explore hybrid architectures where the LLM interfaces with external tools or databases through function-calling APIs or plugins. For instance, an LLM might call a get_table_info tool to dynamically retrieve schema metadata or use a vector search module to resolve ambiguous column references. These tool-augmented LLMs blur the line between static language models and interactive agents.
此外,多模态扩展程序可以包含表格、图表或可视化仪表板等输出格式。虽然这种将文本输入与可视化输出相结合的架构(文本到 SQL 再到可视化)仍处于发展初期,但在商业智能 (BI) 领域正日益普及。
Moreover, multimodal extensions may incorporate tables, charts, or visual dashboards as output formats. While still emerging, architectures that combine text input with visual output (text-to-SQL-to-visualization) are gaining traction in BI settings.
7. 系统集成考量:架构决策还必须考虑延迟、可扩展性和部署环境。一些系统基于云端,通过实时 API 调用OpenAI 的 Codex或Anthropic 的 Claude等模型。另一些系统则在本地运行,使用LLM Meta AI ( Llama ) 或Falcon等开源模型,从而提供更好的控制和隐私保护。缓存常用查询结果并将系统组件模块化,可以确保性能和可维护性。
7. System integration considerations: Architectural decisions must also consider latency, scalability, and deployment environment. Some systems are cloud-based with real-time API calls to models like OpenAI’s Codex or Anthropic’s Claude. Others run locally with open-source models like LLM Meta AI (Llama) or Falcon, offering better control and privacy. Caching frequently used query results and modularizing the system components ensures both performance and maintainability.
8. 用户交互和查询澄清:由于许多查询存在歧义,系统应支持澄清提示。如果存在多种 SQL 解释,则应向用户提供选项:
8. User interaction and query clarification: Since many queries are ambiguous, the system should support clarification prompts. If multiple SQL interpretations are possible, present choices to the user:
a . 您指的是按收入排名的前几名客户,还是按订单数量排名的前几名客户?这样可以避免错误的假设,并建立信任。
a. Did you mean top customers by revenue or by number of orders? This prevents wrong assumptions and builds trust.
9. 治理和合规控制:在企业环境中,确保系统:
9. Governance and compliance controls: In enterprise settings, ensure the system:
a. 编辑或屏蔽生成的 SQL 中的敏感字段。
a. Redacts or masks sensitive fields in generated SQL.
b. 强制执行行级访问限制。
b. Enforces row-level access restrictions.
c. 验证用户凭据和访问范围。
c. Validates user credentials and access scope.
与现有身份和访问管理( IAM ) 系统集成,可确保负责任的部署。
Integrating with existing identity and access management (IAM) systems ensures responsible deployment.
10. 文本到 SQL 系统中的回退和重试机制:随着文本到 SQL 系统不断发展以支持使用自然语言访问结构化数据库,它们必须应对各种歧义、错误和不可预测的用户输入。为了确保可靠性和弹性,现代系统实施了回退和重试策略,这是维护可用性、信任度和准确性的关键组成部分。这些机制主要处理以下问题:
10. Fallback and retry mechanisms in text-to-SQL systems: As text-to-SQL systems evolve to support natural language access to structured databases, they must deal with a range of ambiguities, errors, and unpredictable user inputs. To ensure reliability and resilience, modern systems implement fallback and retry strategies, essential components for maintaining usability, trust, and accuracy. These mechanisms deal with the following:
a. 含糊不清的查询(例如,“上个季度有多少笔销售线索成交?” “成交”的定义不明确)
a. Ambiguous queries (e.g., How many leads closed last quarter? When closed is not clearly defined)
b. 模式不匹配(例如,当表列为total_sales时,却使用了revenue )
b. Schema mismatches (e.g., using revenue when the table column is total_sales)
c. 模型幻觉(例如,引用不存在的表格或列)
c. Model hallucination (e.g., referencing non-existent tables or columns)
d. 执行错误(例如,SQL 语法错误、超时或权限问题)
d. Execution errors (e.g., SQL syntax errors, timeouts, or permission issues)
在文本到 SQL 系统中,回退和重试策略可以有多种形式,每种形式都旨在提高初始查询失败时的可靠性和用户体验。一种方法是自然语言澄清,系统检测到歧义或缺失的上下文后,会提出后续问题,例如,“您指的是总收入还是净利润?”这有助于形成对话循环,从而消除歧义。另一种方法是使用提示优化进行重试,系统会自动使用更具体的模板、模式提示或经过微调的少量示例来调整提示,从而重新生成有效的 SQL,通常不会向用户显示此过程。如果输入过于复杂,系统可能会回退到简化查询,将请求重新表述为更基本的版本,但仍能提供有用的信息;例如,将关于热门 SKU 收入增长的复杂查询简化为“显示各 SKU 的平均收入” 。如果系统完全无法理解提示,则可能会回退到搜索或文档,将用户重定向到仪表板、已保存的查询或相关的架构参考。另一种策略是使用默认查询模板来处理常见请求,例如“十大客户”或“月度趋势” ,尤其是在意图明确但无法精确映射的情况下。
Fallback and retry strategies in text-to-SQL systems can take several forms, each designed to improve reliability and UX when the initial query fails. One approach is natural language clarification, where the system detects ambiguity or missing context and responds with a follow-up question, for example, did you mean total revenue or net profit? This encourages a conversational loop that helps disambiguate intent. Another method is retry with prompt refinement, where the system automatically adjusts the prompt using more specific templates, schema hints, or fine-tuned few-shot examples to regenerate valid SQL, typically without exposing this process to the user. In cases where the input is too complex, the system may perform a fallback to a simplified query, reformulating the request into a more basic version that still yields useful insights; for instance, turning a complex query about revenue growth for top SKUs into a simpler show average revenue by SKU. When the system cannot interpret the prompt at all, it may fallback to search or documentation, redirecting the user to dashboards, saved queries, or relevant schema references. Another strategy involves using default query templates for common requests, such as top 10 customers or monthly trend, especially when the intent is clear but exact mapping fails.
一个健全的重试引擎可能遵循以下逻辑:
A robust retry engine might follow this logic:
尝试:
try:
sql = generate_sql(natural_query)
sql = generate_sql(natural_query)
result = execute_sql(sql)
result = execute_sql(sql)
除 SQLValidationError 外:
except SQLValidationError:
sql = regenerate_with_schema_guidance(natural_query)
sql = regenerate_with_schema_guidance(natural_query)
result = execute_sql(sql)
result = execute_sql(sql)
除非 TableOrColumnNotFound:
except TableOrColumnNotFound:
sql = retry_with_synonym_mapping(natural_query)
sql = retry_with_synonym_mapping(natural_query)
result = execute_sql(sql)
result = execute_sql(sql)
例外情况:
except Exception:
回复“抱歉,我没找到您要找的东西。您能换个方式描述一下吗?”
return "Sorry, I couldn’t find what you’re asking for. Could you rephrase?"
11. 模型部署:实施可以遵循不同的部署策略,具体如下:
11. Deployment of models: Implementation can follow different deployment strategies, which are as follows:
a. 嵌入式 LLM API :使用 OpenAI 的 GPT 或 Azure OpenAI 等外部 API,并提供模式感知提示。
a. Embedded LLM API: Using external APIs like OpenAI’s GPT or Azure OpenAI with schema-aware prompts.
b. 自托管模型:在本地服务器上部署经过微调的较小模型(例如 SQLCoder)。
b. Self-hosted model: Fine-tuned smaller models (e.g., SQLCoder) deployed on local servers.
c. 混合代理系统:基于 LangChain 的编排,使用单独的工具进行解析、验证和重新排序。
c. Hybrid agentic systems: LangChain-based orchestration with separate tools for parsing, validation, and reranking.
部署方式的选择取决于延迟要求、数据安全策略和成本考虑因素。
Choice of deployment depends on latency requirements, data security policies, and cost considerations.
使用 LLM 实现文本到 SQL 系统远不止调用 API 并向用户发出提示那么简单。它涉及对模式上下文的周密整合、精心设计的提示、强大的查询验证以及用户交互设计。如果实现得当,此类系统可以彻底改变用户与数据交互的方式——使非技术用户也能访问结构化数据库,并加速跨领域的洞察生成。然而,在任何实际部署中,可靠性和安全性都必须始终是核心优先事项。
Implementing a text-to-SQL system using LLMs requires far more than calling an API with a user prompt. It involves thoughtful integration of schema context, careful prompt construction, robust query validation, and user interaction design. When implemented well, such systems can transform how users engage with data—making structured databases accessible to non-technical users and accelerating insight generation across domains. However, reliability and safety must remain core priorities in any practical deployment.
12. 可观测性:随着文本到 SQL 系统复杂性的增加,可观测性对于确保可靠性、透明度和持续改进至关重要。可观测性是指通过外部可测量的输出(例如查询日志、模型置信度评分、故障模式和延迟指标)来监控内部状态的能力。无论是学术级系统还是生产级系统,都能从在文本到 SQL 流水线的每个阶段(从自然语言解析到 SQL 生成和执行)使用详细的遥测数据中获益。这有助于错误诊断、用户行为分析、及时优化以及在模型更新期间安全回滚,最终支持系统问责制和负责任的 AI 实践。
12. Observability: As text-to-SQL systems grow in complexity, observability becomes critical for ensuring reliability, transparency, and continuous improvement. Observability refers to the ability to monitor internal states through externally measurable outputs such as query logs, model confidence scores, failure patterns, and latency metrics. Academic and production-grade systems alike benefit from instrumenting each stage of the text-to-SQL pipeline—from natural language parsing to SQL generation and execution—with detailed telemetry. This facilitates error diagnosis, user behavior analysis, prompt optimization, and safe rollback during model updates, ultimately supporting system accountability and responsible AI practices.
13. 反馈循环和纠错接口:最后,在企业级部署中,HITL 升级机制确保未解决的查询被路由至数据分析师,他们可以提供响应,并通过反馈和训练数据为系统改进做出贡献。这些分层回退策略使文本到 SQL 系统更具弹性、适应性和用户友好性。用户应该能够标记错误的输出并提出更正建议。收集这些数据可以用于重新训练、微调或更新规则。
13. Feedback loop and correction interface: Finally, in enterprise-grade deployments, HITL escalation ensures that unresolved queries are routed to data analysts, who can provide responses and contribute to improving the system through feedback and training data. These layered fallback strategies make text-to-SQL systems more resilient, adaptive, and user-friendly. Users should be able to flag incorrect outputs and suggest corrections. Capturing this data enables retraining, fine-tuning, or rule updates.
a. 纠错界面:允许用户编辑生成的 SQL 语句或从排名靠前的备选方案中进行选择。使用此输入来调整未来的提示模板或架构映射。
a. Correction interface: Allow users to edit generated SQL or select from ranked alternatives. Use this input to adjust future prompt templates or schema mappings.
b. 日志记录和分析:跟踪模型置信度、失败原因和常见查询模式。随着时间的推移,这有助于系统改进并识别训练差距。
b. Logging and analytics: Track model confidence, failure reasons, and common query patterns. Over time, this supports system refinement and identifies training gaps.
|
注意:执行环境设置:可靠的执行环境对于安全地执行查询至关重要: Note: Execution environment setup: A reliable execution environment is essential for safe query evaluation:
|
在逐步实现文本到 SQL 系统的指南基础上(该系统通常包含模式导入、提示工程、SQL 解码和查询验证等组件),下一个合乎逻辑的重点是实体提取。实体提取充当非结构化用户查询和结构化数据库元素之间的语义桥梁,使系统能够将自然语言输入与模式词汇表关联起来。无论是作为独立模块还是作为基于代理的编排工作流的一部分,强大的实体提取都能增强可解释性、模块化和 SQL 准确性,为更可靠的下游查询生成奠定基础。
Building on the step-by-step guide for implementing a text-to-SQL system, which typically includes components such as schema ingestion, prompt engineering, SQL decoding, and query validation, the next logical focus is on entity extraction. Entity extraction acts as the semantic bridge between unstructured user queries and structured database elements, enabling the system to ground natural language input in the schema vocabulary. Whether as a standalone module or as part of an agent-based orchestration workflow, robust entity extraction enhances interpretability, modularity, and SQL accuracy, laying the foundation for more reliable downstream query generation.
该流程实现了一个端到端的自然语言处理工作流,将非结构化的产品评论文本转换为结构化的表格数据。它通过使用本地语言学习模型(LLM)提取命名实体(例如客户姓名和购买日期),将其与现有的表格记录合并,并将合并后的结果存储在内存中的 SQL 数据库中,以便后续查询。
This pipeline implements an end-to-end NLP workflow that transforms unstructured product review text into structured tabular data. It achieves this by extracting named entities such as customer names and purchase dates using a local LLM, combining them with existing tabular records, and storing the merged results in an in-memory SQL database for subsequent querying.
该系统采用模块化和智能体设计,利用 LangGraph 库定义有状态图转换,并使用 Ollama 进行逻辑逻辑推理。这种模式支持可扩展、可解释的企业数据集成和问答( QA ) 工作流程。
The system is modular and agentic in design, leveraging the LangGraph library to define stateful graph transitions and Ollama for LLM inference. This pattern supports scalable, interpretable workflows for enterprise data integration and question answering (QA).
下图以可视化的方式展示了基于 LangGraph 的文本到 SQL 预处理流程。它捕捉了多个代理节点之间的条件执行逻辑,这些节点负责解析列语义、生成基于 LLM 的链、提取结构化数据、合并数据集以及填充 SQL 可访问的数据库。该流程从基于列描述可用性的条件入口点开始,并沿着确定性路径进行实体提取和数据整合。重试分支确保了系统的健壮性,而最终决策节点则允许系统填充数据库或优雅地终止。这种模块化设计实现了对 NLP 任务的可解释、状态感知编排。
The following figure visually represents the LangGraph-based workflow for a text-to-SQL preprocessing pipeline. It captures the conditional execution logic between multiple agent nodes responsible for parsing column semantics, generating LLM-based chains, extracting structured data, merging datasets, and populating an SQL-accessible database. The flow begins with a conditional entry point based on the availability of column descriptions and proceeds through a deterministic path of entity extraction and data consolidation. A retry branch ensures robustness, while the final decision node allows the system to either populate the database or terminate gracefully. This modular design enables interpretable, state-aware orchestration of NLP tasks.
图 14.2:使用文本到 SQL 进行实体提取的 LangGraph 工作流程
Figure 14.2: LangGraph workflow for entity extraction using text-to-SQL
完整的端到端代码已在 GitHub 存储库中提供,您可以在那里找到并了解各种架构方法进行实验。
The complete end-to-end code is provided in the GitHub repository, where you can find and understand various architectural approaches to experiment with.
该实现采用基于 LangGraph 框架的多智能体系统架构。工作流程由五个主要智能体组成,每个智能体封装一个离散功能,具体如下:
The implementation is architected as a multi-agent system using the LangGraph framework. The workflow is composed of five main agents, each encapsulating a discrete function, which are as follows:
基于图的控制流控制这些代理之间的转换,支持条件分支、重试逻辑以及对执行顺序的完全控制。
A graph-based control flow governs the transitions between these agents, supporting conditional branching, retry logic, and full control over execution sequencing.
以下步骤概述了代码的详细解读:
The following steps outline a detailed walkthrough of the code:
1. 导入和环境设置:首先,导入所有必要的库,包括用于工作流编排的 LangGraph、用于本地模型交互的 Ollama 以及用于数据处理和存储的 pandas/SQLAlchemy,如下面的代码所示:
1. Imports and environment setup: To begin, all necessary libraries are imported, including LangGraph for workflow orchestration, Ollama for local model interaction, and pandas/SQLAlchemy for data processing and storage, as shown in the following code:
导入 ollama、langgraph
import ollama, langgraph
from langchain_ollama import ChatOllama
from langchain_ollama import ChatOllama
from sqlalchemy import create_engine
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool
from sqlalchemy.pool import StaticPool
该环境结合了用于本地语言逻辑模型 (LLM) 推理的 Ollama 和用于声明式工作流定义的 LangGraph。ChatOllama封装器与模型 llama3.2:3b-instruct-fp16 进行交互,该模型用作命名实体识别( NER ) 引擎。
The environment combines Ollama for local LLM inference and LangGraph for declarative workflow definition. The ChatOllama wrapper interfaces with the model llama3.2:3b-instruct-fp16, which serves as the named entity recognition (NER) engine.
2. 定义共享图状态:该管道使用可变的、类型化的图状态在代理之间传递结构化数据和工件。这种集中式状态设计支持模块化、状态感知转换,如下面的代码所示:
2. Defining the shared graph state: The pipeline uses a mutable, typed graph state to pass structured data and artifacts between agents. This centralized state design supports modular, state-aware transitions, as shown in the following code:
class GraphState(TypedDict):
class GraphState(TypedDict):
问题:str
question: str
...
...
GraphState类型定义了共享可变状态的结构。它包含元数据(例如用户问题)、输入模式、链对象和中间输出。这种设计遵循函数式编程原则,同时支持代理状态在转换过程中的变更。
The GraphState type defines the shape of the shared mutable state. It includes metadata (like the user question), input schema, chain objects, and intermediate outputs. This design adheres to functional programming principles while enabling state mutation across agent transitions.
3. ColumnNameAgent :此代理构建一个 LangChain 管道,连接提示器、本地 LLM 和 JSON 解析器。以下是实体识别过程的核心:
3. ColumnNameAgent: This agent constructs a LangChain pipeline, connecting a prompt, a local LLM, and a JSON parser. The following is the core of the entity recognition process:
类 ColumnNameAgent:
class ColumnNameAgent:
def run(self, state):
def run(self, state):
...
...
该代理将用户定义的column_name_str解析为结构化字典column_names 。每个条目将原始列标签映射到语义描述(例如,“名称”: “<客户名称>” )。这些标签指导 LLM 进行下游提取。
This agent parses the user-defined column_name_str into a structured dictionary column_names. Each entry maps a raw column label to a semantic description (e.g., "Name": "<Name of the customer>"). These tags guide the LLM in downstream extraction.
4. ChainCreationAgent :该代理构建一个 LangChain 管道,连接提示器、本地 LLM 和 JSON 解析器。以下是实体识别过程的核心:
4. ChainCreationAgent: This agent constructs a LangChain pipeline, connecting a prompt, a local LLM, and a JSON parser. The following is the core of the entity recognition process:
链创建代理类:
class ChainCreationAgent:
def run(self, state):
def run(self, state):
...
...
a. LLM 配置了一个PromptTemplate ,指示其执行命名实体识别。提示语采用角色特定的语气,并要求结构化输出:
a. The LLM is configured with a PromptTemplate instructing it to perform named entity recognition. The prompt is crafted in a role-specific tone and demands structured output:
template = """您需要扮演命名实体识别者的角色。
template = """You need to act as a Named Entity Recognizer.
b. 从评论文本中提取以下列名称:
b. Extract the following column names from the review text:
{列名}
{column_names}
...
...
必须严格按照 JSON 格式响应,例如:{"column_1": "<value 1>", ... }
STRICTLY respond in JSON format like: {"column_1": "<value 1>", ... }
"""
"""
该代理初始化一个链,连接提示 | LLM | JSON 解析器。
This agent initializes a chain, linking the prompt | LLM | JSON parser.
5. EntityExtractionAgent :现在,基于模型的链式调用应用于评论文本列表,生成用于下游处理的结构化逐行提取值字典,如下面的代码所示:
5. EntityExtractionAgent: The model-powered chain is now invoked over a list of review texts, generating structured row-wise dictionaries of extracted values for downstream processing, as shown in the following code:
EntityExtractionAgent 类:
class EntityExtractionAgent:
def run(self, state):
def run(self, state):
...
...
该链式操作在df2 的ReviewText列上迭代应用。每个 LLM 输出都被解析并收集到extracted_data中,extracted_data 是一个字典列表,表示结构化的行级提取结果。该代理本质上是将 LLM 实现为一个实体提取器。
The chain is applied iteratively over the ReviewText column in df2. Each LLM output is parsed and collected into extracted_data, a list of dictionaries representing structured row-level extractions. This agent essentially operationalizes the LLM as an entity extractor.
6. 数据组合代理:提取的字段与现有表格数据合并,并按关键列对齐。结果是一个包含原始信息和派生信息的完整结构化数据集,如下面的代码所示:
6. DataCombinationAgent: The extracted fields are merged with the existing tabular data, aligning on key columns. The result is a fully structured dataset with both original and derived information, as shown in the following code:
类 DataCombinationAgent:
class DataCombinationAgent:
def run(self, state):
def run(self, state):
...
...
a. 此阶段执行以下连接操作:
a. This stage performs a join between:
i. 原始结构化表df1
i. The original structured table df1
ii. 新提取的数据框 extracted_df
ii. The newly extracted dataframe extracted_df
连接键根据结构化列定义推断得出。结果保存为merged_data ,并以 CSV 文件格式导出到磁盘。
Join keys are inferred from the structured column definitions. The result is saved as merged_data and exported to disk as a CSV file.
7. DatabaseAgent :合并数据集后,该代理会将输出写入驻留在内存中的 SQLite 数据库,使其可通过 SQL 查询访问:
7. DatabaseAgent: After combining the datasets, this agent writes the output to a memory-resident SQLite database, making it accessible via SQL queries:
数据库代理类:
class DatabaseAgent:
def run(self, state):
def run(self, state):
...
...
合并后的数据会持久化到一个临时的 SQLite 数据库中。SQLAlchemy 配置了一个静态连接池 (StaticPool),以确保内存连接在不同会话之间保持有效。这使得下游的 LLM 或应用程序无需完整的关系数据库管理系统 (RDBMS)即可执行 SQL 查询。
Here, the merged data is persisted in a transient SQLite database. SQLAlchemy is configured with a StaticPool to ensure the in-memory connection remains valid across sessions. This enables downstream LLMs or applications to perform SQL queries without requiring a full relational database management system (RDBMS).
8. 图定义和工作流编译:每个代理都作为节点添加到 LangGraph 中。条件边根据数据可用性和重试逻辑定义执行路径,如下面的代码所示:
8. Graph definition and workflow compilation: Each agent is added to the LangGraph as a node. Conditional edges define execution paths based on data availability and retry logic, as shown in the following code:
工作流 = 状态图(GraphState)
workflow = StateGraph(GraphState)
...
...
graph = workflow.compile()
graph = workflow.compile()
每个代理都作为 LangGraph 中的一个节点添加,它们之间存在条件转换。入口点路由和重试逻辑由自定义函数decide_entry_point和decide_next_step控制。
Each agent is added as a node in the LangGraph, with conditional transitions between them. Entry point routing and retry logic are controlled by custom functions decide_entry_point and decide_next_step.
9. 工作流执行:自定义运行器模拟图的逐步执行。此循环处理节点路由、转换和错误解决,如下所示:
9. Workflow execution: A custom runner simulates step-by-step execution of the graph. This loop handles node routing, transitions, and error resolution, as shown:
def process_workflow(state):
def process_workflow(state):
...
...
`process_workflow`函数按顺序执行流程。这是 LangGraph 图遍历的线性化版本。它会手动执行每个阶段,直到到达终点,并在每次转换时记录输出。
The process_workflow function executes the pipeline sequentially. This is a linearized version of LangGraph's graph traversal. It manually steps through each phase until the end is reached, logging output at each transition.
10. 初始化和示例运行:最后,使用示例数据初始化状态,并执行完整的流程。输出包括最终的结构化数据集和一个用于 SQL 访问的实时数据库引擎,如下面的代码所示:
10. Initialization and example run: Finally, sample data is used to initialize the state, and the full workflow is executed. Outputs include the final structured dataset and a live database engine for SQL access, as shown in the following code:
initial_state = GraphState(...)
initial_state = GraphState(...)
a. 两个示例数据框(df1和df2 )模拟客户数据集及其对应的评论文本。执行后,最终状态包括以下内容:
a. Two toy DataFrames (df1 and df2) simulate a customer dataset and corresponding review texts. Upon execution, the final state includes the following:
i. 提取的实体
i. Extracted entities
ii. 合并数据集
ii. Merged dataset
三、 数据库引擎
iii. Database engine
这种设置便于对结构化结果进行下游查询、可视化或 LLM 辅助分析。
This setup facilitates downstream querying, visualization, or LLM-assisted analytics over the structured result.
此工作流程展示了一种可组合、可解释的架构,它利用本地语言逻辑模型 (LLM) 和基于图的编排,将自然语言数据转换为可用于 SQL 的格式。模块化代理设计增强了可解释性和错误隔离能力,而 LangGraph 则实现了对溢出逻辑和重试机制的灵活控制。此类系统在客户支持自动化、电子商务分析和评论摘要流程中具有重要价值。
This workflow demonstrates a composable, interpretable architecture for turning natural language data into SQL-ready form using local LLMs and graph-based orchestration. The modular agent design enhances explainability and error isolation, while LangGraph enables flexible control of overflow logic and retries. Such systems are valuable in customer support automation, e-commerce analytics, and review summarization pipelines.
第 15 章“代理文本到 SQL 系统和架构决策”中提供了端到端的多数据库代理实现。
An end-to-end Multi DB Agentic implementation is available in Chapter 15, Agentic Text-to-SQL Systems and Architecture Decision-Making.
在当今以数据为中心的经济中,获取可操作的信息对于决策、创新和运营效率至关重要。然而,绝大多数有价值的数据都存储在结构化的关系数据库中,这些数据库通常难以被非技术用户访问。这些用户通常缺乏编写 SQL 查询、理解复杂模式或使用学习曲线陡峭的商业智能 (BI) 工具所需的专业知识。文本到 SQL 系统使用户能够使用自然语言与结构化数据交互,有望通过显著提高数据可访问性并提升组织各层级的数据素养来改变这一现状。
In the modern data-centric economy, access to actionable information is critical for decision-making, innovation, and operational efficiency. Yet the vast majority of valuable data resides in structured relational databases that are often inaccessible to non-technical users. These users typically lack the expertise to write SQL queries, understand schema complexity, or navigate BI tools with steep learning curves. Text-to-SQL systems, which enable users to interact with structured data using natural language, are poised to transform this landscape by dramatically increasing data accessibility and promoting data literacy across organizational hierarchies.
以下列表探讨了文本到 SQL 系统如何改变数据访问格局,从而打造更具包容性、更敏捷、更具备数据素养的员工队伍:
The following list explores how text-to-SQL systems are transforming the landscape of data accessibility, enabling a more inclusive, agile, and data-literate workforce:
文本转 SQL 系统通过允许用户使用自然语言表达查询来消除这一障碍。例如,用户无需等待数据分析师编写 SQL 查询,市场经理可以输入“显示上个月所有转化为客户的潜在客户” 。系统会自动将此输入转换为 SQL 查询语句,执行查询并立即可视化呈现。这样一来,反馈循环速度更快,业务用户就能独立做出基于数据的决策。
Text-to-SQL systems remove this barrier by allowing users to express their queries in plain language. For example, instead of waiting for a data analyst to write a SQL query, a marketing manager could type, show me all leads from last month who converted into customers. This input is translated automatically into SQL, executed, and visualized instantly. The result is a faster feedback loop, empowering business users to make data-informed decisions independently.
通过嵌入特定领域的提示并利用模式感知提示,这些系统可以同时服务于多个部门,而无需他们了解底层数据库结构。这种民主化促进了透明度、跨职能协作以及对关键指标的共同理解。
By embedding domain-specific prompts and leveraging schema-aware prompting, these systems can accommodate multiple departments without requiring them to understand the underlying database structure. This democratization promotes transparency, cross-functional collaboration, and a shared understanding of key metrics.
文本转 SQL 降低了用户与数据交互的门槛。它允许用户使用自然语言构建和迭代数据问题,从而帮助用户直观地理解数据结构以及如何利用数据回答业务问题。随着时间的推移,用户会逐渐形成数据模式的心理模型,理解连接和筛选等操作,甚至提升自身的问题构建能力。
此外,一些教育平台将文本转SQL作为一种教学工具。学习者可以用英语输入问题,并查看这些问题如何映射到SQL语法。这种交互式学习过程有助于理解问题,并增强学习者探索数据的信心。
Text-to-SQL lowers the entry point for engaging with data. By allowing users to formulate and iterate on data questions in natural language, it builds an intuitive understanding of how data is structured and how it can answer business questions. Over time, users begin to develop mental models of the schema, understand joins and filters, and even improve their question formulation skills.
Additionally, some educational platforms use text-to-SQL as a teaching tool. Learners can input questions in English and see how they map to SQL syntax. This interactive learning process supports comprehension and builds confidence in data exploration.
这些问题被转化为可执行的查询并立即执行,从而减少摩擦并实现实时决策。
These questions are transformed into executable queries and delivered instantly, reducing friction and enabling real-time decisions.
这可以提高对数据的信任度,降低误解的风险,并有助于遵守内部政策或监管标准。
This increases trust in data, reduces the risk of misinterpretation, and supports compliance with internal policies or regulatory standards.
这有助于在整个组织内培养更具探索性和洞察力的思维方式,从静态仪表板转向动态查询。
This cultivates a more exploratory, insight-driven mindset across the organization, moving from static dashboards to dynamic querying.
这使得文本转 SQL 不仅成为一项技术创新,而且成为实现数字包容的关键推动因素。
This positions text-to-SQL not only as a technical innovation but also as a key enabler of digital inclusion.
文本转SQL技术不仅仅是一种技术便利,更是实现数据赋能的战略推动力。通过允许用户用自己的语言提问并获得基于结构化数据的可靠答案,这些系统打破了数据与人之间长期存在的壁垒。它们支持自助式分析,培养求知欲,并提升整个组织的数据素养。在未来几年,成功采用文本转SQL系统可能成为那些力求在战略和执行层面真正实现数据驱动的组织的关键差异化因素。
Text-to-SQL technology is more than a technical convenience; it is a strategic enabler of widespread data empowerment. By allowing users to ask questions in their own words and receive reliable answers grounded in structured data, these systems break down longstanding barriers between data and people. They enable self-service analytics, foster a culture of curiosity, and elevate the data literacy of an organization as a whole. In the coming years, the successful adoption of text-to-SQL systems may be a key differentiator for organizations that seek to be truly data-driven in both strategy and execution.
评估文本到 SQL 系统的性能是一项复杂且多方面的任务。这些系统生成的并非简单的标签或连续值,而是结构化的查询,这些查询必须语法正确且语义符合用户的意图。此外,SQL 查询通常有多种正确的表达方式,这进一步增加了评估的难度。本节将介绍文本到 SQL 研究和实践中使用的关键评估指标,提供详细的定义和使用指南,并介绍实际部署的最佳实践。
Evaluating the performance of text-to-SQL systems is a complex and multifaceted task. These systems do not produce simple labels or continuous values; instead, they generate structured queries that must be both syntactically correct and semantically aligned with the user's intent. Moreover, there is often more than one correct way to express a query in SQL, which complicates evaluation further. This section introduces key evaluation metrics used in text-to-SQL research and practice, providing detailed definitions and guidance on their use, followed by best practices for real-world deployments.
精确匹配准确率衡量的是生成的 SQL 查询与参考(真实)查询完全匹配的百分比,包括所有元素,例如子句、表名、别名和格式。这是一个非常严格的指标,即使是细微的差异(例如不同的连接顺序或别名的使用)也会被视为错误。它通常应用于基准数据集,例如 Spider,在这些数据集中,真实查询是可用的,并且任务被定义为从自然语言到 SQL 的一对一映射。
Exact match accuracy measures the percentage of generated SQL queries that match the reference (ground truth) queries exactly, including all elements such as clauses, table names, aliases, and formatting. This is a strict metric where even a minor variation (such as a different join order or use of an alias) is considered an error. It is typically applied in benchmark datasets like Spider, where ground truth is available and the task is framed as one-to-one mapping from natural language to SQL.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
执行准确率评估生成的 SQL 语句在目标数据库上执行时,是否返回与参考 SQL 查询相同的结果。它直接比较两个查询的输出,如果结果集匹配,则认为它们相等,而与查询结构无关。
Execution accuracy evaluates whether the generated SQL, when executed on the target database, returns the same result as the reference SQL query. It directly compares the output of both queries and considers them equal if the result sets match, regardless of query structure.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
该指标将 SQL 查询分解为SELECT 、WHERE 、GROUP BY 、ORDER BY 、HAVING和JOIN子句等结构组件。它衡量相对于参考查询,这些组件中有多少被正确预测。
This metric decomposes the SQL query into structural components such as SELECT, WHERE, GROUP BY, ORDER BY, HAVING, and JOIN clauses. It measures how many of these components are correctly predicted relative to the reference query.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
该指标衡量的是生成的 SQL 查询中,能够在数据库上成功执行且不触发语法或运行时错误的百分比。它并不评估结果的正确性,而只评估查询是否能够运行。
This metric measures the percentage of generated SQL queries that can be executed successfully on the database without triggering syntax or runtime errors. It does not assess the correctness of results, only whether the query can run.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
语义等价性测试旨在确定两个 SQL 查询语句尽管语法形式不同,但功能是否相同。这通常需要在比较之前对查询语句进行规范化或归类(例如,移除别名、重新排列连接顺序)。
Semantic equivalence testing aims to determine whether two SQL queries are functionally identical despite differing syntactic forms. This often involves normalizing or canonicalizing the queries (e.g., removing aliases, reordering joins) before comparing them.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
人工评估是指由专家审核员根据正确性、清晰度、相关性和效率等标准评估生成的 SQL 语句的质量。审核员可以手动执行查询,也可以对照模式检查其逻辑。
Human evaluation involves expert reviewers assessing the quality of the generated SQL based on criteria such as correctness, clarity, relevance, and efficiency. Reviewers may manually execute queries or inspect their logic against the schema.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
这些运行指标衡量生成 SQL 查询所需的时间(延迟)以及在给定时间段内可以处理的查询数量(吞吐量)。它们反映了系统的响应速度和可扩展性。
These operational metrics measure the time required to generate a SQL query (latency) and the number of queries that can be processed in a given time period (throughput). They indicate system responsiveness and scalability.
以下列表概述了其优点、局限性和应用场景:
The following list outlines the advantages, limitations, and use cases:
评估文本到 SQL 系统不仅仅是检查输出是否正确——它需要一种结构化的方法,平衡量化指标、代表性数据集和持续改进。以下实践有助于确保进行有意义且可靠的性能评估:
Evaluating text-to-SQL systems requires more than checking for correct outputs—it demands a structured approach that balances quantitative metrics, representative datasets, and continuous improvement. The following practices help ensure meaningful, reliable performance assessment:
这确保了强大的泛化能力。
This ensures robust generalization.
分析这些模式有助于快速改进和调整模型。
Analysing these patterns helps in prompt refinement and model tuning.
衡量文本到 SQL 系统的性能,不仅仅需要准确性。一个全面的评估框架必须包含语法、语义和操作指标。从精确匹配和执行正确性到延迟和人工反馈,这些指标为模型改进、部署就绪和用户信任奠定了基础。随着这些系统的不断发展,标准化评估方法对于衡量进展、确保公平性以及指导大规模实际应用至关重要。
Measuring the performance of text-to-SQL systems demands more than accuracy alone. A holistic evaluation framework must incorporate syntactic, semantic, and operational metrics. From exact matches and execution correctness to latency and human feedback, these metrics provide the foundation for model improvement, deployment readiness, and user trust. As these systems continue to evolve, standardizing evaluation methodologies will be essential for benchmarking progress, ensuring fairness, and guiding practical adoption at scale.
本章全面介绍了文本到 SQL 系统的基础组件,弥合了自然语言查询和结构化数据访问之间的鸿沟。我们首先探讨了文本到 SQL 的基本概念,包括模式链接、SQL 生成以及语言学习模型 (LLM) 在解释用户模糊意图中的作用。然后,我们研究了系统架构模式,从简单的提示策略到基于代理的执行图。实际应用案例展示了文本到 SQL 如何赋能 BI、医疗保健、教育和金融等领域的用户。我们还分析了与模式对齐、验证和部署相关的技术挑战,并介绍了实施和性能评估的最佳实践。总而言之,这些见解为设计可靠、可扩展且以用户为中心的文本到 SQL 解决方案提供了蓝图。
This chapter has provided a comprehensive introduction to the foundational components of text-to-SQL systems, bridging the gap between natural language queries and structured data access. We began by exploring the basic concepts underlying text-to-SQL, including schema linking, SQL generation, and the role of LLMs in interpreting ambiguous user intent. We then examined system architecture patterns, ranging from simple prompting strategies to agent-based execution graphs. Real-world applications demonstrated how text-to-SQL can empower users across domains such as BI, healthcare, education, and finance. We also analyzed the technical challenges associated with schema alignment, validation, and deployment, followed by best practices for implementation and performance evaluation. Together, these insights offer a blueprint for designing reliable, scalable, and user-centric text-to-SQL solutions.
文本转SQL不仅仅是一项技术创新,它代表着人们与数据交互方式的根本性转变。通过降低查询关系数据库的门槛,它提升了组织的数据素养,并加快了跨角色和职能部门的决策速度。
Text-to-SQL is not merely a technical innovation; it represents a fundamental shift in how individuals interact with data. By lowering the barrier to querying relational databases, it promotes organizational data literacy and accelerates decision-making across roles and functions.
下一章将介绍一种先进的、基于智能体的多查询文本到 SQL 系统。我们将探讨基于 LLM 的智能体如何协作处理多轮对话、连接推理和查询分解,从而在复杂的真实环境中实现稳健且可解释的数据检索。
The next chapter will introduce an advanced, agentic multi-query text-to-SQL system. We will explore how LLM-powered agents can collaborate to handle multi-turn dialogues, join reasoning, and query decomposition, enabling robust and explainable data retrieval in complex, real-world environments.
本章将从上一章结束的地方继续探讨。智能体文本到结构化查询语言(SQL )系统的出现,标志着人类与结构化数据交互方式的重大革新。这些系统不再依赖静态规则或预定义模板,而是利用由大型语言模型(LLM )、检索机制和推理框架(例如 LangChain)驱动的自主智能体,将自然语言问题动态地转换为可执行的 SQL 查询。本章将探讨设计此类智能系统所需的架构和决策策略。
In this Chapter, we start from where we left off in the last chapter. Agentic text-to-Structured Query Language (SQL) systems represent a significant evolution in how humans interact with structured data. Rather than relying on static rules or pre-defined templates, these systems use autonomous agents, powered by large language models (LLMs), retrieval mechanisms, and reasoning frameworks like LangChain, to dynamically translate natural language questions into executable SQL queries. This chapter explores the architecture and decision-making strategies required to design such intelligent systems.
这些架构的核心是一个多步骤的编排流程,包括查询嵌入、语义搜索、模式匹配、SQL 生成和联合执行。从基于句子转换器的嵌入到 LangChain 的 ReAct 代理(带有思维链( CoT ) 提示),每个组件都在维护准确性、适应性和透明度方面发挥着至关重要的作用。全局索引、模式匹配器和预过滤的使用确保了代理能够以最小的错误或歧义处理跨数据库查询。
At the core of these architectures lies a multi-step orchestration process involving query embedding, semantic search, schema matching, SQL generation, and federated execution. Each component—from Sentence Transformer-based embedding to LangChain’s ReAct agent with chain of thought (CoT) prompting, plays a crucial role in maintaining accuracy, adaptability, and transparency. The use of global indexes, schema matchers, and pre-filtering ensures that the agent can handle cross-database queries with minimal hallucination or ambiguity.
本章详细解析了架构图中所示的完整流程,并解释了关键的设计选择,包括何时使用联邦查询引擎(FQE)、如何实现索引感知检索,以及如何使用基于LLM的评估器对SQL输出进行评分和重新排序。读完本章,读者将获得一个结构化的蓝图,用于实现可扩展、可靠且可解释的、针对实际企业需求量身定制的智能体文本到SQL系统。
This chapter breaks down the full pipeline shown in the architectural diagram and explains key design choices, including when to use fedrated query engines (FQEs), how to implement index-aware retrieval, and how to score and rerank SQL outputs using LLM-based evaluators. By the end, readers will gain a structured blueprint for implementing scalable, reliable, and interpretable agentic text-to-SQL systems tailored to real-world enterprise needs.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在介绍一个模块化且可扩展的框架,用于构建智能文本到 SQL 系统,从而实现跨分布式结构化数据库的自然语言查询。该系统结合了语言学习模型 (LLM)、规划代理、模式感知工具和语义索引,能够智能地将用户查询转换为可执行的 SQL 语句。本章概述了使用 LangChain 的 ReAct 代理、全局索引查找和联合 SQL 执行来开发此类系统的架构、实现和设计权衡。其目标是帮助从业人员构建能够进行自适应推理并准确执行多数据库查询的强大、上下文感知型 SQL 代理。
The objective of this chapter is to present a modular and scalable framework for building agentic text-to-SQL systems that enable natural language querying across distributed, structured databases. By combining LLMs with planning agents, schema-aware tool use, and semantic indexing, the system intelligently translates user queries into executable SQL. This chapter outlines the architecture, implementation, and design trade-offs involved in developing such systems using LangChain's ReAct agent, global index lookup, and federated SQL execution. The goal is to empower practitioners to build robust, context-aware SQL agents capable of adaptive reasoning and accurate multi-database query execution.
现代零售企业在分布式数据库中生成海量数据,例如客户资料存储在 PostgreSQL 中,订单存储在 MySQL 中,营销日志存储在 MongoDB 中,库存数据则存储在不同的系统中。这些数据孤岛阻碍了快速、数据驱动的决策,尤其对于难以使用 SQL 查询结构化数据库的非技术用户而言更是如此。数据团队手动编写查询语句会导致瓶颈、延误和敏捷性下降。
Modern retail businesses generate massive data across distributed databases, customer profiles in PostgreSQL, orders in MySQL, marketing logs in MongoDB, and inventory in separate systems. These data silos hinder fast, data-driven decisions, especially for non-technical users who struggle to query structured databases using SQL. Manual query writing by data teams leads to bottlenecks, delays, and lost agility.
该零售企业希望实现对多个异构数据库中的销售、客户和库存数据的实时、自然语言访问。
The retail enterprise seeks to enable real-time, natural language access to its sales, customer, and inventory data across multiple heterogeneous databases.
以下列出了目前存在的痛点:
The following list outlines the current pain points:
因此,我们的目标是开发一个具有模式感知工具、语义索引和联合执行功能的代理文本到 SQL 系统,使业务用户能够进行智能的自助查询,而无需具备 SQL 专业知识。
So, our goal is to develop an agentic text-to-SQL system with schema-aware tool use, semantic indexing, and federated execution to enable intelligent, self-serve querying for business users, without needing SQL expertise.
图 15.1展示了完整的文本到 SQL 流水线的概览,说明了从用户查询到 SQL 执行再到最终响应的数据和控制的时间顺序。图中每个步骤都有标签,并对应于一个不同的系统功能:
Figure 15.1 presents a high-level view of the complete text-to-SQL pipeline, illustrating the chronological flow of data and control from user query to SQL execution and final response. how each step in the figure is labeled and corresponds to a distinct system function:
Figure 15.1: A very high-level workflow of the complete text-to-SQL pipeline
图15.2展示了一个端到端的智能文本到 SQL 系统架构,该系统旨在连接自然语言查询和结构化的多数据库环境。该流程展示了用户输入如何通过一系列智能步骤转换为 SQL,包括嵌入生成、模式匹配、语义检索和基于 CoT 的 SQL 合成。该系统的核心是 LangChain 的 ReAct 智能体框架,集成了预过滤、LLM 推理、SQL 分级和可选的联合执行。这种架构支持跨孤立数据集进行实时、模式感知的查询,使业务用户无需编写 SQL 即可获取可操作的洞察,从而使企业分析更快、更易于访问且更具上下文关联性。
The following Figure 15.2 architecture illustrates an end-to-end agentic text-to-SQL system designed to bridge natural language queries with structured, multi-database environments. The pipeline showcases how user input is transformed into SQL through a series of intelligent steps, embedding generation, schema matching, semantic retrieval, and CoT-based SQL synthesis. At its core, the system leverages LangChain’s ReAct agent framework, integrating pre-filtering, LLM reasoning, SQL grading, and optional federated execution. This architecture enables real-time, schema-aware querying across siloed datasets, empowering business users to retrieve actionable insights without writing SQL, making enterprise analytics faster, more accessible, and highly contextual.
Figure 15.2: Solution design of an agentic text-to-SQL solution
图 15.2展示了完整的文本到 SQL 流水线的概览,说明了从用户查询到 SQL 执行再到最终响应的数据和控制流的先后顺序。步骤说明了图中每个步骤的标签及其对应的系统功能:
Figure 15.2 presents a high-level view of the complete text-to-SQL pipeline, illustrating the chronological flow of data and control from user query to SQL execution and final response. The steps explain how each step in the figure is labeled and corresponds to a distinct system function:
1. 用户输入:当用户提交自然语言查询(例如,通过 Streamlit 界面或应用程序编程接口( API ))时,流程开始。查询将被传递到后端管道进行处理。
1. User input: The process begins when the user submits a natural language query (e.g., via Streamlit interface or application programming interface (API)). The query is passed into the backend pipeline for processing.
2. 查询嵌入生成:使用预训练的句子转换器模型将输入查询转换为向量表示。该嵌入捕捉了查询的语义含义。
2. Query embedding generation: The input query is transformed into a vector representation using a pre-trained Sentence Transformer model. This embedding captures the semantic meaning of the query.
3. 全局索引:查询嵌入被转发到全局索引(例如,在 ChromaDB 中实现),以便基于相似性检索相关的模式摘要或历史查询模式。
3. Global index: The query embedding is forwarded to the global index (e.g., implemented in ChromaDB) for similarity-based retrieval of relevant schema summaries or historical query patterns.
4. LangChain :调用 LangChain 来协调模式摘要、匹配和过滤任务的推理和工具使用。
4. LangChain: LangChain is invoked to orchestrate the reasoning and tool usage for schema summarization, matching, and filtering tasks.
5. 模式匹配器:将查询意图与可用的数据库模式进行比较,以确保所选表和列在语义上与用户的请求一致。
5. Schema matcher: Compares the query intent against available database schemas to ensure the selected tables and columns are semantically aligned with the user’s request.
6. 检查全局索引:将匹配的模式与全局索引进行交叉引用,以验证和改进跨数据库的一致性和相关性选择。
6. Checks the global index: The matched schema is cross-referenced with the global index to validate and refine the selection for consistency and relevance across databases.
7. pre_filter(query_embedding) : 对查询嵌入应用预过滤函数,以减少向量索引中的搜索空间,提高检索效率。
7. pre_filter(query_embedding): A pre-filtering function is applied to the query embedding to reduce the search space in the vector index, improving retrieval efficiency.
8. 对汇总数据进行语义搜索:使用预先过滤的嵌入对汇总的模式/数据执行语义搜索,帮助选择与 SQL 构建最相关的内容。
8. Semantic search on summarized data: Performs a semantic search over summarized schema/data using the pre-filtered embedding, helping select the most relevant content for SQL construction.
9. 使用 LangChain React 代理和 CoT 提示生成 SQL 查询:基于检索到的模式和查询意图,LangChain 代理使用 CoT 提示策略生成初始 SQL 查询,以确保清晰性和正确性。
9. SQL query generation using LangChain React agent and CoT prompt: Based on retrieved schema and query intent, the LangChain agent generates an initial SQL query using a CoT prompting strategy for clarity and correctness.
10. 生成的 SQL 查询:SQL 查询以结构化的可执行形式生成,具有适当的子句(例如,SELECT 、WHERE ),反映了用户查询的语义含义。
10. SQL query generated: The SQL query is produced in structured executable form, with proper clauses (e.g., SELECT, WHERE) reflecting the semantic meaning of the user’s query.
11. SQL 查询评分:生成的 SQL 由 LLM 进行评估,以验证语法正确性和与原始问题的语义一致性。
11. SQL query is graded: The generated SQL is evaluated by an LLM to verify syntactic correctness and semantic alignment with the original question.
12. 使用 LangChain React 代理、CoT 提示和 FQE(可选)执行 SQL 查询:执行最终的 SQL 查询,可以选择使用FQE聚合来自多个数据库的结果:
12. SQL query execution using LangChain React agent, CoT prompt, and FQE (optional): Executes the final SQL query, optionally using a FQE to aggregate results from multiple databases:
a. 访问全局索引
a. Accesses the global index
b. 跨多个数据库执行 SQL 查询
b. SQL query executed across multiple databases
13. 向 LangChain 发送响应:执行结果被发送回 LangChain 管道进行后处理和格式化。
13. Response sent to LangChain: Execution results are sent back into the LangChain pipeline for post-processing and formatting.
14. 用户响应:最终输出(包括 SQL 查询、检索到的数据以及可选的摘要)将通过用户界面( UI )发送回用户。所有计算均在内存中进行;不会持久化任何中间结果。
14. Response to user: The final output, including the SQL query, retrieved data, and an optional summary, is sent back to the user via the user interface (UI). All computation is in-memory; no intermediate results are persisted.
为了实现基于代理的文本到 SQL 架构,系统采用模块化设计,在配置、核心逻辑、UI 前端和特定任务模块之间实现了清晰的职责分离。以下文件夹结构展示了使用 LangChain、Ollama 和 ChromaDB 的可扩展实现,支持基于向量的检索和多数据库 SQL 执行。模式匹配、SQL 生成、摘要和查询评分等核心组件被抽象为可重用的任务。`global_index_db /`存储向量索引,而前端则处理用户交互。以下结构支持轻松扩展和稳健地编排整个文本到 SQL 管道,从自然语言输入到联合查询响应:
To operationalize the agentic text-to-SQL architecture, the system is modularly implemented with a clear separation of concerns across configuration, core logic, UI frontend, and task-specific modules. The following folder structure represents a scalable implementation using LangChain, Ollama, and ChromaDB, enabling both vector-based retrieval and multi-database SQL execution. Core components such as schema matching, SQL generation, summarization, and query grading are abstracted into reusable tasks. The global_index_db/ stores vector indexes, while the frontend handles user interaction. The following structure supports easy extension and robust orchestration of the entire text-to-SQL pipeline, from natural language input to federated query response:
Figure 15.3: Folder structure of agentic text-to-SQL solution
完整的代码可以在 GitHub 代码库中找到。
The end-to-end code is available in the GitHub repository.
为了实现用户友好的自然语言查询界面,该系统利用 Streamlit 构建前端 UI。用户可以使用纯英文输入问题,并在 Web 应用中交互式地查看结果。后端逻辑以 API 的形式提供(与 UI 解耦时),系统使用 FastAPI 和 Uvicorn 高效地创建和托管异步端点。这些组件确保用户与后端处理层之间实现无缝交互,而无需深厚的技术知识。
To enable a user-friendly interface for natural language querying, the system leverages Streamlit for the frontend UI. This allows users to input plain English questions and view results interactively in a web application. For serving backend logic as APIs (when decoupled from the UI), FastAPI and Uvicorn are used to create and host asynchronous endpoints efficiently. These components ensure seamless interaction between the user and the backend processing layers without requiring deep technical expertise.
该流程的核心是一个由 LangChain 驱动的强大的智能体推理框架,它利用工具和 CoT 提示来解析用户查询并动态生成 SQL。LangChain 社区模块通过集成增强了此功能。诸如 ChromaDB 和 SQL 连接器之类的工具。句子转换器库为用户查询和文档生成高质量的词嵌入,这些词嵌入存储在高速向量数据库 ChromaDB 中,并可进行检索。这种基于词嵌入的语义检索确保了模式感知和上下文准确的结果。此外,Ollama 还用于运行本地语言学习模型(LLM),例如Llama或Mistral ,这些模型执行查询生成、摘要生成和输出验证等任务。
At the core of the pipeline lies a robust agentic reasoning framework powered by LangChain, which utilizes tools and CoT prompting to deconstruct user queries and generate SQL dynamically. LangChain Community modules enhance this functionality with integrations to tools like ChromaDB and SQL connectors. The Sentence Transformers library generates high-quality embeddings for user queries and documents, which are stored and retrieved using ChromaDB, a high-speed vector database. This embedding-based semantic retrieval ensures schema-aware, contextually accurate results. Additionally, Ollama is used to run local LLMs, such as Llama or Mistral, which perform tasks like query generation, summarization, and output validation.
最后,Trino 作为联合 SQL 查询引擎,支持跨多个结构化数据源(例如 PostgreSQL、MySQL)无缝执行 SQL 查询。这确保系统能够实时访问和聚合来自不同数据库的数据。SQLite3 用于轻量级的本地结构化数据集存储,使其成为原型设计或小规模部署的理想选择。结合 API 通信请求和轻量级执行逻辑,该技术栈构成了一个功能强大的本地文本到 SQL 解决方案,无需依赖任何云服务或外部服务。
Finally, Trino serves as the federated SQL query engine that allows seamless execution of SQL across multiple structured data sources (e.g., PostgreSQL, MySQL). This ensures that the system can access and aggregate data from disparate databases in real-time. SQLite3 is used for lightweight local storage of structured datasets, making it ideal for prototyping or small-scale deployment. Combined with requests for API communication and lightweight execution logic, this stack forms a powerful, locally runnable text-to-SQL solution with no cloud or external service dependencies.
以下列出了在本地运行此项目的设置步骤:
The following list outlines the setup steps to run this project locally:
1. 克隆或提取项目:使用以下代码提取并导航到项目文件夹:
1. Clone or extract project: Use the following code to extract and navigate to the project folder:
解压缩 Chapter_15_Text2SQL-main.zip
unzip Chapter_15_Text2SQL-main.zip
cd Text2SQL-main/ollama_pipeline_with_ui
cd Text2SQL-main/ollama_pipeline_with_ui
2. 创建并激活虚拟环境(推荐):使用以下代码创建并激活虚拟环境,以便清晰地管理依赖项:
2. Create and activate a virtual environment (recommended): Create and activate a virtual environment to manage dependencies cleanly using the following code:
python -m venv venv
python -m venv venv
source venv/bin/activate # 在 Windows 系统上:venv\Scripts\activate
source venv/bin/activate # On Windows: venv\Scripts\activate
3. 安装依赖项:使用提供的requirements.txt文件安装所有必要的依赖项:
3. Install dependencies: Use the provided requirements.txt file to install all necessary dependencies:
pip install -r requirements.txt
pip install -r requirements.txt
确保 Ollama 已安装并在本地运行(例如,ollama run mistral )。
Make sure that Ollama is installed and running locally (e.g., ollama run mistral).
4. 初始化数据库:此脚本会在data/目录中创建或填充本地 SQLite 数据库:
4. Seed the database: This script creates or fills a local SQLite database located in the data/ directory:
python seed_sqlite_data.py
python seed_sqlite_data.py
这将在data/文件夹内创建或填充本地 SQLite 数据库。
This will create or populate a local SQLite database inside the data/ folder.
5. 运行主管道(仅限后端):如果前端应用程序存在于frontend /文件夹中,则使用以下代码启动 UI:
5. Run the main pipeline (backend only): If a frontend app exists in the frontend/ folder, start the UI with the following code:
python main.py
python main.py
如果需要,您可以编辑main.py来调用run_query("在这里输入您的查询" ) 。
You can edit main.py to invoke run_query("your query here") if needed.
6. 运行 Streamlit UI(可选) :如果frontend/目录下有 Streamlit 应用,请运行以下代码:
6. Run Streamlit UI (optional): If a Streamlit app is available in frontend/, run the following code:
streamlit 运行 frontend/app.py
streamlit run frontend/app.py
本节将系统地介绍代理文本到 SQL 系统中的所有 Python 源文件。每个模块在将自然语言查询转换为结构化 SQL 响应的过程中都扮演着特定的角色。该架构采用模块化设计,包含代理、任务、核心逻辑、用户界面和设置脚本,以确保灵活性、清晰度和可重用性。本节的解释旨在帮助那些可能不熟悉基于代理的推理、向量搜索或基于 LangChain 的编排的读者。
This section provides a structured walkthrough of all Python source files in the agentic text-to-SQL system. Each module plays a specific role in transforming natural language queries into structured SQL responses. The architecture is modularized into agents, tasks, core logic, UI, and setup scripts to ensure flexibility, clarity, and reusability. The explanations here are aimed at readers who may be new to agent-based reasoning, vector search, or LangChain-based orchestration.
以下列表概述了负责协调系统核心逻辑的层,该层协调数据初始化、查询处理和代理调用,以驱动端到端流程:
The following list outlines the layer that orchestrates the system’s core logic, coordinating data seeding, query handling, and agent invocation to drive the end-to-end flow:
以下模块实现了基于 LangChain 的 ReAct 框架的智能代理,这些代理能够分解用户意图、推理模式并准备任务输入:
The following modules implement intelligent agents based on LangChain’s ReAct framework that decompose user intent, reason over schemas, and prepare task inputs:
请参考以下列表,其中包括为嵌入、数据库访问、LLM 交互和实用程序逻辑等核心服务提供支持的基础层:
Refer to the following list, which includes the foundational layers powering core services like embeddings, database access, LLM interaction, and utility logic:
以下组件封装了诸如 SQL 生成、评分、摘要和模式匹配等离散任务,这些任务通常由 LLM 驱动:
The following components encapsulate discrete tasks like SQL generation, grading, summarization, and schema matching, often driven by LLMs:
该模块提供了一个用户友好的图形界面,可通过 Streamlit 应用程序实现交互式自然语言查询。
This module provides a user-friendly graphical interface, enabling interactive natural language querying through a Streamlit app.
Kfrontend/app.py定义了基于 Streamlit 的系统图形界面。用户可以输入自然语言问题,触发完整的流程,并以交互方式查看结果。这使得非技术用户也能轻松使用该系统。
Kfrontend/app.py defines the Streamlit-based graphical interface for the system. Users can enter natural language questions, trigger the full pipeline, and view results interactively. This makes the system accessible to non-technical users.
本节包含用于初始化向量索引并将数据库模式信息嵌入 Chroma 向量存储中,从而为语义检索准备系统的脚本:
This section contains scripts that initialize vector indexes and prepare the system for semantic retrieval by embedding database schema information into the Chroma vector store:
这种模块化文件设计体现了现代人工智能系统开发的最佳实践,将用户交互、推理、存储和执行等功能区分开来。每个模块都设计为可独立测试和可扩展,支持可扩展部署和迭代增强。
This modular file design reflects best practices in modern AI system development, separating concerns across user interaction, reasoning, storage, and execution. Each module is designed to be independently testable and extensible, supporting scalable deployment and iterative enhancement.
下一节,让我们来了解代码的内部运作原理。
In the next section, let us understand the inner workings of the code.
在本节中,我们将了解使用 LangChain、Ollama、ChromaDB 和 SQLite 实现的代理式文本到 SQL 系统的内部结构和执行逻辑。该项目被组织成清晰的模块化组件,分别代表流程的不同阶段:查询理解、模式概括、SQL 生成、评分和结果聚合。该系统在设计上兼顾了可扩展性和清晰度,采用结构化的文件夹层级和代理驱动的编排层,从而能够将自然语言查询无缝转换为可执行的 SQL。
In this section, we will understand the internal structure and execution logic of an agentic text-to-SQL system implemented using LangChain, Ollama, ChromaDB, and SQLite. The following project is organized into clearly modular components that represent the distinct phases of the pipeline: query understanding, schema summarization, SQL generation, grading, and result aggregation. Designed for extensibility and clarity, the system employs a structured folder hierarchy and an agent-driven orchestration layer to enable seamless translation from natural language queries into executable SQL.
schema_results = summarization_schema_agent.invoke({"input": query})
aggregated_result = aggregate_summarized_data(query)
sql_query = generate_sql(...)
sql_grade = grade_sql(sql_query)
summary_grade = grade_summary(aggregated_result["final_summary"])
该协调器调用摘要代理,聚合检索到的信息,生成 SQL 查询,并随后对 SQL 查询和聚合后的摘要进行评分。最终输出以字典形式返回并记录,但不会持久化到任何文件或数据库中。
schema_results = summarization_schema_agent.invoke({"input": query})
aggregated_result = aggregate_summarized_data(query)
sql_query = generate_sql(...)
sql_grade = grade_sql(sql_query)
summary_grade = grade_summary(aggregated_result["final_summary"])
This orchestrator calls the summarization agent, aggregates the retrieved information, generates a SQL query, and subsequently grades both the SQL and the aggregated summary. The final output is returned as a dictionary and logged but not persisted to any file or database.
同级文件agents/sql_agent.py提供了一个可用于更深入 SQL 推理的替代代理。但是,该代理在主编排路径中不会被主动调用。
The sibling file agents/sql_agent.py offers an alternative agent that may be used for deeper SQL reasoning. However, this agent is not actively invoked in the main orchestration path.
这些模块中的函数都是通过main.py中的协调器调用,或者通过代理工具间接调用。
The functions from these modules are all invoked through the orchestrator in main.py or indirectly via agent tools.
这些文件主要作为后端服务运行,由更高级别的任务和代理模块调用。
These files operate primarily as backend services that are invoked by the higher-level task and agent modules.
此初始化对于在分布式数据集上实现语义搜索和查询执行至关重要。
This initialization is vital for enabling semantic search and query execution over distributed datasets.
为了配合系统组件的模块化分解,本节按时间顺序概述了完整的文本到 SQL 流水线,如图15.1所示。每个步骤对应一个独立的处理阶段,从用户输入和查询嵌入到 SQL 生成、评分、执行和响应交付。该图以可视化的方式抽象展示了数据和控制信号如何在系统内的代理、工具和数据库之间传播。这种由 LangChain 驱动、ChromaDB 和 SQLite 支持的分层编排机制,确保用户查询能够被上下文理解和转换。准确地将数据导入 SQL,并高效执行。以下是对图中每个编号阶段的详细说明。
To complement the modular breakdown of system components, this section presents a chronological overview of the full text-to-SQL pipeline as illustrated in the workflow Figure 15.1. Each step corresponds to a discrete processing stage, from user input and query embedding to SQL generation, grading, execution, and response delivery. The figure provides a visual abstraction of how data and control signals propagate across agents, tools, and databases within the system. This layered orchestration, driven by LangChain and supported by ChromaDB and SQLite, ensures that user queries are interpreted contextually, translated into SQL accurately, and executed efficiently. What follows is a detailed explanation of each numbered stage in the figure.
该解决方案的主要优势在于其集成了基于 LangChain 的 ReAct 代理以及用于模式理解和语义对齐的专用工具。summarization_schema_agent能够智能地解读用户意图,并调用模式摘要和匹配工具,从而实现对各种数据库结构的稳健适应。这些工具确保代理即使在异构环境中也能保持上下文感知和模式敏感性。这种代理工具协同作用不仅减少了 SQL 生成过程中的“幻觉”,还支持用于摘要、聚合和评估的模块化插件逻辑,从而确立了系统在实现对关系型数据库的精确、可解释且可扩展的自然语言访问方面的核心优势。
The primary differentiator of this solution lies in its integration of a LangChain-based ReAct agent with specialized tools for schema understanding and semantic alignment. The summarization_schema_agent intelligently interprets user intent and invokes tools for schema summarization and matching, enabling robust adaptation across varied database structures. These tools ensure that the agent remains context-aware and schema-sensitive, even in heterogeneous environments. This agent tool synergy not only reduces hallucination in SQL generation but also allows modular plug-in logic for summarization, aggregation, and evaluation, establishing the system’s core advantage in enabling precise, explainable, and scalable natural language access to relational.
以下流程采用单个活动的 LangChain ReAct 代理以及一组已注册的独立工具来执行模式解释、SQL 生成和质量评估:
The following pipeline employs a single active LangChain ReAct agent along with a set of registered and standalone tools to perform schema interpretation, SQL generation, and quality assessment:
agents/summarization_schema_agent.py
虽然还有第二个文件agents/sql_agent.py ,但它在当前执行路径( main.py )中并未被使用。因此,整个流程中只涉及一个代理。
agents/summarization_schema_agent.py
While there is a second file, agents/sql_agent.py, it is not actively used in the current execution path (main.py). Thus, only one agent is involved in the pipeline.
此外,在 LangChain 代理之外但在管道内部,main.py直接使用以下任务级工具:
Additionally, outside the LangChain agent but within the pipeline, main.py directly uses the following task-level tools:
查询经过嵌入、模式匹配、摘要生成、SQL生成和评分等处理后,系统会生成多个结构化输出。这些输出完全在内存中生成,并通过Streamlit界面呈现给最终用户。这些输出既便于人理解,也便于机器验证评估。
Once the query has been processed through embedding, schema matching, summarization, SQL generation, and grading, the system produces multiple structured outputs. These outputs are generated entirely in-memory and are rendered via the Streamlit interface for the end user. The outputs serve both human interpretability and machine-verifiable evaluation.
主要输出结果如下:
The key outputs are as follows:
下图所示的最终汇总摘要表明,输出结果全面、易于理解地汇总了与用户查询语义匹配的个人信息和数据库级记录。它突出显示了具有唯一条目、跨数据库标识符以及诸如年龄、城市和ID等汇总属性的个人信息。重要的是,它在适用情况下将跨数据库的条目解析为统一的实体。
The final aggregated summary depicted in the following figure indicates that the output presents a comprehensive human-readable synthesis of individuals and database-level records matched semantically to the user query. It highlights individuals with unique entries, cross-database identifiers, and summarized attributes such as Age, City, and ID. Importantly, it resolves entries across databases into unified entities where applicable.
此摘要更深入地展现了底层数据结构,列举了每个数据库中的记录,并突出显示了 ID 范围和相关人员。它有助于验证模式一致性,并清晰地展示了数据的检索和规范化过程。
A deeper representation of the underlying data structure, this summary enumerates the records found in each database, highlighting ID ranges and associated individuals. It aids in verifying schema alignment and provides transparency into how data was retrieved and normalized.
下图显示了文本转 SQL 系统用户界面生成的详细实体和数据库摘要,突出显示了唯一个体、潜在的数据重复项及其源数据库记录:
The following figure shows the detailed entity and database summary generated by the text-to-SQL system’s UI, highlighting unique individuals, potential data duplicates, and their source database records:
该系统生成符合用户意图且语法正确的 SQL 查询语句。该查询语句基于 CoT 提示模板构建,包含了所选表、筛选列和筛选条件。它清晰地展现了用于构建查询逻辑的推理步骤。下图展示了文本转 SQL 系统用户界面生成的 SQL 查询语句分解过程,概述了查询构建的每个步骤,从意图识别到表/列选择和筛选条件构建,最终生成可执行的 SQL 语句。
The system produces a syntactically correct SQL query that corresponds to the user’s intent. Constructed using a CoT prompting template, this query encapsulates the selected table, filtered columns, and conditions. It reflects an interpretable breakdown of reasoning steps used to form the query logic. The following figure displays the generated SQL query breakdown by the text-to-SQL system UI, outlining each step in the query construction, from intent recognition to table/column selection and filter condition formulation, leading to the final executable SQL.
图 15.6:用户界面上由文本转 SQL 系统生成的 SQL 查询。
Figure 15.6: Generated SQL query from text-to-SQL system on UI
为了验证查询质量,系统会调用一个评分工具,评估查询的正确性、相关性和执行效率。评分被分解为多个组成部分,每个部分都有相应的解释,其中包括对潜在歧义、索引效率或逻辑清晰度的观察。这种评分方式有助于提高查询的可解释性和优化查询。
To validate query quality, the system invokes a grading tool that evaluates correctness, relevance, and execution efficiency. The score is broken into components, each explained, and includes observations about potential ambiguity, indexing efficiency, or logical clarity. This grading supports explainability and query refinement.
下图展示了文本到 SQL 系统的SQL 查询评分界面,该界面从正确性、相关性和效率三个维度评估生成的查询,并为每个分数提供详细的理由。
The following figure presents the SQL Query Grade interface of the text-to-SQL system, evaluating the generated query across correctness, relevance, and efficiency dimensions, and providing a detailed justification for each score.
系统还会对先前生成的摘要进行评估,以确保其准确性、清晰度和完整性。评分器会识别可能存在的重复内容或模式层面的遗漏(例如,条目数量不明确),并提供改进文本呈现的建议。最终评分确保最终用户获得可验证的见解。
The summary produced earlier is also evaluated by the system for accuracy, clarity, and comprehensiveness. The grader identifies possible duplication or schema-level omissions (e.g., entry count ambiguity) and provides suggestions for enhancing textual presentation. This final score ensures the end user receives verifiable insights.
该智能文本转SQL系统生成的输出能够直接有效地应对初始问题陈述中提出的挑战。在传统的零售和企业数据环境中,由于缺乏SQL专业知识,业务用户常常难以查询大型的、孤立的数据库,导致洞察延迟和错失良机。该系统通过允许用户使用自然语言与分布式异构数据集交互来弥补这一不足,同时在内部协调模式对齐、语义理解、SQL生成和验证等工作。
The outputs generated by the agentic text-to-SQL system offer a direct and effective response to the challenges outlined in the initial problem statement. In traditional retail and enterprise data environments, business users often struggle to query large, siloed databases due to a lack of SQL expertise, resulting in delayed insights and missed opportunities. This system addresses that gap by allowing users to interact with distributed, heterogeneous datasets using natural language, while internally orchestrating schema alignment, semantic understanding, SQL generation, and validation.
最终的汇总摘要和详细数据库输出使业务利益相关者能够从多个数据库中获取清晰易懂的洞察,而无需了解数据库结构或手动编写 SQL 代码。这些摘要整合了相关数据,解决了数据库中的重复条目问题,并以便于快速解读和后续决策的格式呈现了可操作的模式(例如,客户画像、城市分布等)。
The final aggregated summary and detailed database output enable business stakeholders to receive clear and human-readable insights drawn from multiple databases without having to understand their structure or write SQL manually. These summaries consolidate relevant data, resolve duplicate entries across databases, and surface actionable patterns (e.g., customer profiles, city-wise distributions) in a format suitable for rapid interpretation and downstream decision-making.
此外,生成的 SQL 查询及其评分输出具有两个至关重要的作用:首先,它们清晰地展示了系统如何将用户意图转化为结构化的数据库查询;其次,它们提供了关于正确性、相关性和效率的可验证的质量评估,从而增强了用户对自动化流程的信任。总结评分进一步确保文本输出符合清晰度、完整性和事实准确性的标准,使该解决方案适用于报告和业务用途。
Moreover, the generated SQL query and its Grading Outputs serve two vital purposes: first, they transparently show how the system translates a user’s intent into structured database queries; second, they provide verifiable quality assessments on correctness, relevance, and efficiency, instilling trust in the automated process. The Summary Grade further ensures that the textual output meets standards of clarity, completeness, and factual accuracy, making the solution suitable for reporting and business use.
这些功能共同将手动、容易出错的查询过程转变为完全自动化、可解释且可扩展的管道,使非技术团队能够实时访问数据库中的信息,并根据上下文丰富的、经过验证的信息采取果断行动。
Collectively, these capabilities transform a manual, error-prone querying process into a fully automated, explainable, and scalable pipeline—empowering non-technical teams to access insights across databases in real-time and act decisively based on context-rich, validated information.
该系统并非旨在取代数据工程师,而是增强他们的能力,并减少企业数据查询中的操作瓶颈。通过自动化执行例行且重复的 SQL 生成任务,该平台使业务用户能够独立获取洞察,从而使数据工程师能够专注于数据建模、管道优化和治理等更高层次的活动。该解决方案在不影响模式完整性、执行正确性或系统可审计性的前提下,实现了对结构化数据的民主化访问。如此一来,它既提高了各个岗位的效率,又保留了技术数据团队的关键职责和监督职能。
This system is not intended to replace data engineers but rather to augment their capabilities and reduce the operational bottlenecks in querying enterprise data. By automating routine and repetitive SQL generation tasks, the platform empowers business users to retrieve insights independently, allowing data engineers to focus on higher-order activities such as data modeling, pipeline optimization, and governance. The solution democratizes access to structured data without compromising schema fidelity, execution correctness, or system auditability. In doing so, it enhances productivity across roles while preserving the critical responsibilities and oversight provided by technical data teams.
本章全面深入地剖析了代理式文本到 SQL 系统的内部运作机制,重点介绍了支撑其功能的设计、逻辑和输出结构。我们从 main.py 中的编排逻辑入手,探讨了系统如何按顺序调用模式摘要、聚合、SQL 生成和质量控制等步骤。采用模块化、工具驱动的代理流水线进行评分。集成一个配备模式感知工具的 ReAct 式 LangChain 代理,构成了智能查询解释和响应生成的核心。
This chapter has provided a comprehensive walkthrough of the inner workings of an agentic text-to-SQL system, highlighting the design, logic, and output structure that underpin its functionality. Beginning with the orchestration logic in main.py, we examined how the system sequentially invokes schema summarization, aggregation, SQL generation, and quality grading using a modular, tool-driven agent pipeline. The integration of a single ReAct-style LangChain agent, equipped with schema-aware tools, forms the backbone of intelligent query interpretation and response generation.
针对特定任务的模块确保了职责的清晰划分,每个组件(例如摘要器、SQL 生成器和评分器)都执行不同的、可验证的角色。核心基础架构模块支持基于向量的检索、LLM 交互和多数据库 SQL 执行。ChromaDB 和 SQLite 的协同使用实现了跨结构化数据源的可扩展且语义丰富的查询。
The task-specific modules ensure clear separation of responsibilities, with each component, such as the summarizer, SQL generator, and grader, performing distinct, verifiable roles. Core infrastructure modules provide support for vector-based retrieval, LLM interaction, and multi-database SQL execution. The use of ChromaDB and SQLite in tandem enables scalable and semantically enriched querying across structured data sources.
该系统的输出,包括最终摘要、分级 SQL 和注重可解释性的反馈,证明了其对技术和非技术利益相关者的可用性。通过利用智能规划、CoT 提示和本地 LLM,该架构平衡了透明度、适应性和性能。因此,它为在企业环境中部署文本到 SQL 系统提供了一个务实的蓝图,在这些环境中,精确性、模式一致性和实时反馈至关重要。
The system’s outputs, including final summaries, graded SQL, and interpretability-focused feedback, demonstrate its usability for both technical and non-technical stakeholders. By leveraging agentic planning, CoT prompting, and local LLMs, the architecture balances transparency, adaptability, and performance. In doing so, it represents a pragmatic blueprint for deploying text-to-SQL systems in enterprise environments where precision, schema alignment, and real-time feedback are essential.
下一章,我们将讨论如何将光学字符识别( OCR ) 与生成式人工智能( GenAI ) 相结合,以构建智能管道,将图像转换为可操作的搜索见解。
In the next chapter, we will discuss integration of optical character recognition (OCR) with generative AI (GenAI) to build intelligent pipelines that convert images into actionable search insights.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
本章将探讨如何将光学字符识别( OCR ) 与生成式人工智能( GenAI ) 相结合,构建智能流程,将图像转化为可操作的搜索洞察。目标是从图像(例如产品照片、广告或目录截图)中提取有意义的文本信息,并利用这些信息指导决策、产品发现或搜索重定向。
In this chapter, we will explore the integration of optical character recognition (OCR) with generative AI (GenAI) to build intelligent pipelines that convert images into actionable search insights. The goal is to extract meaningful textual information from images, such as product photos, advertisements, or catalog screenshots, and use that information to guide decision-making, product discovery, or search redirection.
我们首先利用 EasyOCR,这是一个基于 Python 的 OCR 库,能够高精度地检测图像中的文本。提取文本后,将其传递给本地部署的轻量级大型语言模型( LLM )(通过 Ollama 实现),以生成自然语言搜索查询。该查询模拟了人们在亚马逊、Flipkart或eBay等热门购物平台上搜索类似或更佳商品的方式。
We begin by leveraging EasyOCR, a Python-based OCR library that provides high-accuracy text detection in images. Once text is extracted, it is passed to a lightweight large language model (LLM) hosted locally via Ollama, to generate a natural language search query. This query reflects how a human might search for similar or better alternatives on popular shopping platforms like Amazon, Flipkart, or eBay.
该流程随后执行统一资源定位符( URL ) 重定向,以模拟在这些平台上的搜索,或使用轻量级爬虫抓取部分页面内容。提取的片段再次使用 LLM 进行汇总,以便为用户提供快速的对比概览,展示类似产品、优惠或价格趋势。
The pipeline then performs Uniform Resource Locator (URL) redirection to simulate searches on these platforms or fetches partial page content using lightweight scraping. The extracted snippets are summarized again using the LLM to provide users with a quick comparative overview, showcasing similar products, offers, or pricing trends.
该架构(如图 16.1所示)是模块化的、可解释的、可本地部署的,因此非常适合构建 GenAI 购物助手或可视化产品比较工具。
The architecture (illustrated in Figure 16.1) is modular, interpretable, and deployable locally, making it ideal for building GenAI shopping assistants or visual product comparison tools.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在使读者掌握使用高级多模态技术进行光学字符识别 (OCR) 的知识和实践技能。读者将学习如何使用能够解释视觉和文本数据的基础模型,从图像和便携式文档格式( PDF ) 文件中提取文本。本章介绍了用于文档理解的 Mistral OCR应用程序编程接口( API ),并重点阐述了其在智能流程中的集成。特别强调了从包含表格数据的收据中提取结构化信息,以便进行后续分析。学习完本章后,读者将能够为各种实际格式和布局构建强大的 OCR 系统。
The objective of this chapter is to equip readers with the knowledge and practical skills to perform OCR using advanced multimodal techniques. Readers will learn how to extract text from images and Portable Document Format (PDF) using foundation models capable of interpreting both visual and textual data. The chapter introduces the Mistral OCR application programming interface (API) for document understanding and highlights its integration into intelligent pipelines. Special emphasis is placed on extracting structured information from receipts with tabular data, enabling downstream analysis. By the end of the chapter, readers will be able to build robust OCR systems for diverse real-world formats and layouts.
在构建处理和理解视觉输入的智能系统中,光学字符识别 (OCR) 仍然是一项基础性功能。随着对无缝解读图像文本的需求不断增长,利用传统机器学习( ML )、基于 Transformer 的语言模型和多模态推理系统执行 OCR 的技术也在不断发展。以下部分将介绍并对比 GenAI 环境下的三种不同的 OCR 方法:将独立的 OCR 引擎集成到 GenAI 工作流程中、使用原生训练的语言模型 (LLM) 执行 OCR,以及采用能够直接进行图像到文本转换的多模态语言模型。
In the context of building intelligent systems that process and understand visual input, OCR remains a foundational capability. As the demand for seamless interpretation of image-based text grows, so does the evolution of techniques to perform OCR using traditional machine learning (ML), transformer-based language models, and multimodal reasoning systems. The following section introduces and contrasts three distinct approaches to OCR in a GenAI context, which are wrapping standalone OCR engines within GenAI workflows, using LLMs natively trained to perform OCR, and employing multimodal LLMs capable of direct image-to-text comprehension:
|
注意:Mistral OCR 是一个专用的 OCR 基础模型,并非简单的实用程序封装或插件。它被设计成一个功能强大的基础模型,专为 OCR 和复杂文档理解任务而构建。主要细节包括:
|
|
Note: Mistral OCR is a dedicated OCR foundation model, not just a utility wrapper or plug-in. It is designed to be a powerful base model purpose-built for OCR and complex document understanding tasks. Key details include the following:
|
这些方法分别代表了模块化、泛化性和系统复杂性这三个维度上的不同侧重点。具体选择哪种方法取决于延迟、基础设施、模型可用性和可解释性要求等约束条件。本章重点介绍第一种方法,即将 EasyOCR 封装到 GenAI 流水线中,因为它简单有效,并且适用于本地部署的智能体。
Each of these approaches reflects a different point on the spectrum of modularity, generalization, and system complexity. The approach chosen will depend on constraints such as latency, infrastructure, model availability, and interpretability requirements. In this chapter, we focus on the first method, wrapping EasyOCR within a GenAI pipeline due to its simplicity, effectiveness, and suitability for locally-deployed intelligent agents.
下图展示了三种不同的基于 GenAI 的 OCR 集成策略,从独立的 OCR 基础模型到使用 API 封装的传统 OCR 的模块化管道,再到统一 OCR 的精细调整的多模态 LLM:
The following figure illustrates three distinct GenAI-based OCR integration strategies, ranging from standalone OCR foundation models to modular pipelines using traditional OCR wrapped in APIs to fine-tuned multimodal LLMs that unify OCR:
在电子商务和数字市场主导的时代,消费者常常面临海量的产品选择,每种产品都伴随着不同的规格、品牌和价格。虽然在线平台提供了丰富的搜索界面,但用户仍然经常依赖图片、朋友的截图、商店展示照片或社交媒体帖子来表达他们的意图,如图16.2所示。对许多人来说,手动搜索每个产品详情的传统方法既繁琐又低效:
In an era dominated by e-commerce and digital marketplaces, consumers are often faced with an overwhelming number of product choices, each accompanied by varying specifications, brands, and price points. While online platforms provide rich search interfaces, users frequently rely on images, screenshots from friends, photographs of store displays, or social media posts to express their intent, as explained in Figure 16.2. For many, the traditional approach of manually searching for each product detail is cumbersome and inefficient:
假设一位用户截取了一张耳机广告的屏幕截图,广告中显示了品牌、技术规格和折扣信息。该用户想知道在相同价位范围内,其他知名品牌(例如索尼或JBL)是否有更好的替代品。然而,图片中的文字无法复制,手动搜索又非常耗时。这时,智能视觉助手就显得尤为重要。
Consider a user who captures a screenshot of a headphone advertisement showing a brand, technical specs, and a discount. The user wants to know whether better alternatives are available within the same price range from other trusted brands like Sony or JBL. However, the text in the image cannot be copied, and searching manually is time-consuming. This is where an intelligent visual assistant becomes invaluable.
在本用例中,我们介绍了一种结合了光学字符识别 (OCR) 和 GenAI 系统的流程,以实现整个发现过程的自动化。该流程首先使用 OCR 从图像中提取相关的产品信息,包括产品名称、规格(例如,3.5 毫米插孔、麦克风支持、线缆长度)、价格和折扣。然后,提取的文本被传递给本地语言学习模型 (LLM)(通过 Ollama),该模型会生成一个自然语言查询,模拟真实用户在线搜索替代产品的方式。
In this use case, we introduce a pipeline that combines OCR with a GenAI-powered system to automate the entire discovery process. The pipeline begins by extracting relevant product information from the image using OCR. This could include product names, specifications (e.g., 3.5mm jack, mic support, length of cable), pricing, and discounts. The extracted text is then passed to a local LLM (via Ollama), which generates a natural language query that mimics how a real user might search for alternatives online.
该系统并非直接显示原始 OCR 文本,而是模拟在亚马逊、Flipkart 和 eBay 等多个电商平台上的搜索结果。它获取这些搜索结果或产品信息片段,并使用相同的 LLM 进行汇总。最终用户然后,系统会提供市场上各种替代方案的简洁明了且具有背景信息的比较,而无需打开多个网站或进行手动研究。
Rather than just displaying the raw OCR text, the system simulates search results on multiple e-commerce platforms such as Amazon, Flipkart, and eBay. It fetches these search results or snippets of product information and summarizes them using the same LLM. The end user is then presented with a concise and contextual comparison of alternatives available in the market, without needing to open multiple websites or conduct manual research.
这种方法显著提升了偏好视觉输入、预算有限或希望寻找更智能购物方式而无需重复搜索的用户的购物体验。对于价格敏感型市场和移动优先用户而言,这种方法尤其有利,因为他们通常使用屏幕截图和社交媒体来了解产品信息。
This approach significantly improves the shopping experience for users who prefer visual input, are on a budget, or are looking for smarter alternatives without investing time in repetitive searches. It is especially beneficial for price-sensitive markets and mobile-first users who often use screenshots and social media as their primary mode of capturing product interest.
最终,该用例展示了如何将 OCR 与 GenAI 系统相结合,以弥合非结构化视觉输入和结构化、可操作的洞察之间的差距,从而为智能的多模态消费者工具铺平道路。
Ultimately, this use case demonstrates how combining OCR with GenAI-enabled systems to bridge the gap between unstructured visual input and structured, actionable insight, paving the way for intelligent, multimodal consumer tools.
利用LLM的OCR技术将传统的图像文本提取转化为语义丰富的理解任务。与仅转录可见字符的传统OCR不同,基于LLM的OCR可以解释布局、推断结构并为提取的内容赋予上下文信息。这种方法能够智能地提取文档中的标题、表格、标签和关系。当与图16.1所示的GenAI流程结合使用时,该系统可以返回Markdown或结构化输出,甚至可以回答有关内容的问题。这为自动化文档分析、数字归档、合规性和视觉数据驱动的决策等工作流程解锁了强大的功能。图16.3展示了Storm有线耳机的产品列表,呈现了一个丰富的多模态数据示例,其中视觉、文本和语义元素相互交织。它包含产品照片、描述性元数据(例如,技术规格、用户评分和价格)以及诸如受欢迎程度和折扣详情等上下文线索。对于支持OCR的GenAI系统而言,这张图片不仅仅是提取文本;它旨在理解产品相关性,解析层级属性(例如品牌、功能、价格、优惠),并将它们映射到可执行的输出,例如搜索查询或结构化记录。这种多模态输入非常适合结合视觉语言模型( VLM ) 和智能文本提取的流程,从而实现更智能的购物助手或产品推荐引擎。
OCR using LLMs transforms traditional text extraction from images into a semantically rich understanding task. Unlike conventional OCR, which only transcribes visible characters, LLM-based OCR can interpret layout, infer structure, and contextualize extracted content. This approach allows for intelligent extraction of headings, tables, labels, and relationships across the document. When combined with the GenAI pipeline shown in Figure 16.1, the system can return markdown or structured outputs and even answer questions about the content. This unlocks powerful capabilities for automating workflows in document analysis, digital archiving, compliance, and visual data-driven decision-making. Figure 16.3 showcases a product listing for the Storm Wired Headphone, presenting a rich example of multimodal data where visual, textual, and semantic elements are intertwined. It contains a product photo, descriptive metadata (e.g., technical specs, user ratings, and pricing), and contextual cues such as popularity and discount details. For an OCR-enabled GenAI system, this image is not just about extracting text; it is about understanding product relevance, parsing hierarchical attributes (e.g., brand, features, price, offer), and mapping them to actionable outputs like search queries or structured records. Such multimodal inputs are ideal for pipelines that combine vision-language models (VLMs) and intelligent text extraction to enable smarter shopping assistants or product recommendation engines.
让我们来了解一下这个项目的文件夹结构。下图概述了一个支持 OCR 的 GenAI 流程的模块化结构,该流程旨在用于视觉产品发现。工作流程从放置在assets/文件夹中的输入图像开始,该图像通过image_utils.py处理,并使用 EasyOCR 提取文本。然后,原始文本通过search_utils.py和本地 LLM 转换为便于搜索的查询。搜索 URL 由web_scraper.py生成,并用于获取实时产品摘要。接下来的摘要由summarizer.py进行汇总,同样利用了 LLM。整个流程由main.py协调,从而提供了一个完全本地化且可解释的图像到洞察系统。
Let us understand the folder structure of this project. The following figure outlines the modular structure of an OCR-enabled GenAI pipeline designed for visual product discovery. The workflow begins with an input image placed in the assets/ folder, which is processed using image_utils.py to extract text via EasyOCR. This raw text is converted into a search-friendly query using a local LLM via search_utils.py. Search URLs are generated by web_scraper.py and used to fetch real-time product snippets. The following snippets are summarized using summarizer.py, again leveraging an LLM. The entire pipeline is orchestrated through main.py, offering a fully local and interpretable image-to-insight system.
该系统采用模块化架构处理基于图像的输入,并将其转化为可操作的购物情报。该流程旨在接收与产品相关的图像,例如零售包装盒的照片、聊天截图或放置在assets/文件夹中的促销横幅。随后,系统使用 OCR 工具(EasyOCR)分析图像,提取可见文本。提取的原始文本通过 Ollama 传递给本地 LLM,后者会生成用户可能在 Flipkart 或 Amazon 上输入的真实搜索查询。该查询用于构建真实的电商搜索 URL。最后,系统抓取并汇总这些 URL 中的产品列表,以提供类似或更佳替代品的概览。该流程完全在本地运行,因此适用于需要保护隐私或离线使用的场景。
This system follows a modular architecture for processing image-based inputs and turning them into actionable shopping intelligence. The pipeline is designed to accept a product-related image, such as a photo of a retail box, a screenshot from a chat, or a promotional banner placed into the assets/ folder. From there, the image is analyzed using an OCR tool (EasyOCR) to extract visible text. The resulting raw text is passed into a local LLM via Ollama, which generates a realistic search query a user might type on Flipkart or Amazon. That query is used to construct real e-commerce search URLs. Finally, product listings from these URLs are scraped and summarized to provide an overview of similar or better alternatives. The pipeline runs entirely locally, making it useful for privacy-preserving or offline scenarios.
该架构采用模块化设计,由以下五个关键组件构成:
The architecture is modular and composed of five key components, which are as follows:
完整的代码可以在 GitHub 代码库中找到。
The end-to-end code can be found in the GitHub repository.
该流程依赖于几个关键的 Python 库,详见requirements.txt 文件。首先,easyocr (以及torch和torchvision )负责从图像中提取文本。pillow库支持图像加载和预处理(如有需要)。ollama包是本地托管的 LLM(例如 Llama 3 )的接口,使您无需依赖云 API 即可生成搜索查询和摘要。Web 访问由requests和beautifulsoup4管理,它们用于轻量级地抓取产品列表。selenium和google-search-results等可选包已列出,但在此版本中未积极使用,为未来扩展动态抓取或基于 SerpAPI 的 Google 搜索集成留出了空间。总而言之,这些库的要求极低,使系统具有良好的可移植性和离线兼容性,如下图所示:
The pipeline depends on a few critical Python libraries, as outlined in requirements.txt. First, easyocr (along with torch and torchvision) powers the text extraction from images. pillow supports image loading and preprocessing, if needed. The ollama package is the interface to locally hosted LLMs like Llama 3, enabling you to generate search queries and summaries without relying on cloud APIs. Web access is managed by requests and beautifulsoup4, which are used for lightweight scraping of product listings. Optional packages like selenium and google-search-results are listed but not actively used in this version, providing room for future expansion with dynamic scraping or SerpAPI-based Google search integration. Overall, the requirements are minimal and keep the system portable and offline-friendly, as shown in the following figure:
图 16.5:Requirement.txt 快照,可在运行整个代码之前运行。
Figure 16.5: Requirement.txt snapshot, which can be run before running the entire code
以下部分阐述了解决方案的整体流程。首先,它使用 OCR 从图像中提取文本;然后,通过语言学习模型 (LLM) 将文本转换为自然语言搜索查询;接着,从电商平台获取产品列表;最后,通过另一次 LLM 调用汇总搜索结果。该设计注重清晰性、可追溯性和优雅的错误处理,使其能够稳健地应用于实际场景。
The following section explains the overall flow of the solution. It begins by extracting text from an image using OCR, converts that text into a natural language search query via a LLM, fetches product listings from e-commerce platforms, and finally summarizes the results using another LLM call. The design emphasizes clarity, traceability, and graceful error handling, making it robust for real-world use.
1. 使用 EasyOCR 进行 OCR 识别——从图像中提取文本:流程中的第一个主要步骤是使用 EasyOCR 从图像中读取并提取文本信息。此逻辑在image_utils.py 文件中实现,其中初始化了一个预训练的 OCR 模型,该模型支持英语语言,并配置为基于 CPU 的推理。核心函数extract_text_from_image(image_path)读取图像并返回一个由识别出的单词组成的字符串。例如,如果图像显示“boAt 有线耳机₹ 499”,OCR 引擎会将其作为纯文本返回。这一步骤至关重要,因为它将非结构化的视觉数据转换为下游组件(例如 LLM)可以理解和推理的结构化格式。
1. OCR with EasyOCR—extracting text from images: The first major step in the pipeline involves using EasyOCR to read and extract textual information from an image. This logic is implemented in image_utils.py, where a pre-trained OCR model is initialized with English language support and configured for CPU-based inference. The core function extract_text_from_image(image_path) reads the image and returns a joined string of recognized words. For example, if the image says boAt Wired Earphones ₹499, the OCR engine will return that as plain text. This step is crucial because it translates unstructured visual data into a structured format that downstream components (like the LLM) can understand and reason over.
a. 使用 EasyOCR 处理图像:
a. The image is processed using EasyOCR:
reader = easyocr.Reader(['en'], gpu=False)
reader = easyocr.Reader(['en'], gpu=False)
results = reader.readtext(image_path, detail=0)
results = reader.readtext(image_path, detail=0)
此函数从图像中提取纯文本字符串。例如,一张产品包装盒的图像可能会返回“ JBL 有线耳机带麦克风₹ 799”。
This extracts plain text strings from an image. For example, an image of a product box might return,JBL Wired Headphones with Mic ₹799.
2. 通过 LLM 生成查询——将文本转换为意图:从图像中提取文本后,会将其传递给托管在 Ollama 上的本地 LLM。search_utils.py 中的 generate_search_query(ocr_text) 函数会生成一个提示,要求模型将 OCR 文本转换为用户友好的搜索短语,就像您在 Flipkart 上输入以查找类似或更佳产品的短语一样。例如,如果提取的文本是boAt 3.5mm 有线耳机₹ 799 ,LLM 可能会返回“带麦克风的有线耳机,价格低于 800”。这一步骤弥合了原始图像内容和可用于搜索的意图之间的差距。这是一个简单而强大的示例,展示了 LLM 如何解释模糊的输入并将其上下文关联到特定任务。
2. Query generation via LLM—converting text into intent: Once the text has been extracted from the image, it is passed to the local LLM hosted via Ollama. The function generate_search_query(ocr_text) in search_utils.py constructs a prompt asking the model to convert the OCR text into a realistic, user-friendly search phrase, something you would type into Flipkart to discover similar or better products. For example, if the extracted text is boAt 3.5mm Wired Headphones ₹799, the LLM might return wired headphones with a mic under 800. This step bridges the gap between raw image content and search-ready intent. It is a simple but powerful example of how LLMs can interpret ambiguous input and contextualize it for specific tasks.
a. 提取的文本被传递给LLM的提示:
a. The extracted text is passed into a prompt for the LLM:
prompt = f"以下产品文本是从图像中提取的:\n\n{ocr_text}..."
prompt = f"The following product text was extracted from an image:\n\n{ocr_text}..."
response = ollama.chat(model="llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": prompt}])
response = ollama.chat(model="llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": prompt}])
b. LLM 返回一个简化的搜索词,例如:
b. The LLM returns a simplified search phrase like:
800元以下的有线耳机,带麦克风
wired headphones with mic under 800
3. URL 构建——仅重定向购物链接: web_scraper.py模块不执行基于 API 的产品检索,而是为亚马逊、Flipkart 和 eBay 等主流平台构建直接搜索 URL。这是通过使用urllib.parse.quote_plus进行简单的字符串编码和动态 URL 模板来实现的。函数get_product_listings(query)接收生成的搜索查询,并将其插入到每个平台的相应搜索 URL 结构中。例如,像“1000 元以下的无线耳机”这样的查询将变为https://www.amazon.in/s?k=wireless+earbuds+under+1000。这种设计选择使得该流程独立于 API,能够适应平台变化,并且部署速度快。
3. URL construction—redirect-only shopping links: Instead of performing API-based product retrieval, the web_scraper.py module builds direct search URLs for major platforms like Amazon, Flipkart, and eBay. This is achieved through simple string encoding using urllib.parse.quote_plus and dynamic URL templating. The function get_product_listings(query) takes the generated search query and inserts it into the appropriate search URL structure for each platform. For example, a query like wireless earbuds under 1000 will become https://www.amazon.in/s?k=wireless+earbuds+under+1000. This design choice allows the pipeline to be API-independent, robust to platform changes, and fast to deploy.
a. 您的系统不调用 API,而是构建直接搜索 URL:
a. Instead of calling APIs, your system constructs direct search URLs:
i. f"https://www.amazon.in/s?k={encoded_query}"
i. f"https://www.amazon.in/s?k={encoded_query}"
ii. f"https://www.flipkart.com/search?q={encoded_query}"
ii. f"https://www.flipkart.com/search?q={encoded_query}"
iii. f"https://www.ebay.com/sch/i.html?_nkw={encoded_query}"
iii. f"https://www.ebay.com/sch/i.html?_nkw={encoded_query}"
这样就可以根据生成的查询重定向到真实的商品列表。
This allows redirection to real product listings based on the generated query.
4. 网页摘要提取与汇总:准备好搜索 URL 后,系统使用requests 库获取每个产品列表页面的 HTML 内容。在summarizer.py 文件中,函数fetch_page_snippet(url)会扫描页面,并从常见的 HTML 标签(例如<a>、<div> 和<span> )中提取易于理解的产品相关摘要(例如标题、描述、价格) 。然后,LLM 模型会使用第二个提示对这些摘要进行汇总,该提示会要求模型提取主题、关键词和定价模式。函数summarize_product_pages(product_listings)会遍历所有搜索结果,从每个结果中提取摘要,并返回一组易于理解的摘要。 每个门店一份。这一步骤通过提供综合概览而非直接堆砌原始文本,提升了用户体验。
4. Web snippet extraction and summarization: With the search URLs ready, the system fetches the HTML content of each product listing page using requests. In summarizer.py, the function fetch_page_snippet(url) scans the page and collects readable product-related snippets (e.g., titles, descriptions, prices) from common HTML tags like <a>, <div>, and <span>. These snippets are then summarized by the LLM using a second prompt that asks the model to extract themes, keywords, and pricing patterns. The function summarize_product_pages(product_listings) loops over all search results, fetches snippets from each, and returns a set of human-readable summaries, one for each store. This step elevates the user experience by providing a synthesized overview rather than dumping raw text.
a. 使用requests + BeautifulSoup从每个网站获取基本文本片段:
a. Basic text snippets are fetched from each site using requests + BeautifulSoup:
soup = BeautifulSoup(response.text, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
for tag in soup.find_all(['a', 'div', 'span'], limit=100):
for tag in soup.find_all(['a', 'div', 'span'], limit=100):
snippets.append(tag.get_text(strip=True))
snippets.append(tag.get_text(strip=True))
b. 然后,LLM 对这些片段进行总结:
b. Then, the snippets are summarized by LLM:
prompt = f"以下是来自 {site_name} 的一些产品列表:\n\n{joined_text}"
prompt = f"The following are some product listings from {site_name}:\n\n{joined_text}"
response = ollama.chat(...)
response = ollama.chat(...)
c. 这将返回如下摘要:常见商品包括 boAt、JBL 和 Sony,价格低于1000卢比,带麦克风和防缠绕线缆。
c. This returns summaries like: common listings include boAt, JBL, and Sony under ₹1,000 with mic and tangle-free cables.
5. 通过 main.py 实现完整的流水线编排:main.py脚本作为整个系统的入口点和编排器。它首先扫描assets/文件夹以查找第一个可用的图像。然后,使用extract_text_from_image()处理该图像,并将生成的文本通过generate_search_query()转换为搜索查询,最后将该查询传递给get_product_listings()以生成购物链接。最后,调用summarize_product_pages()来获取、解析和汇总产品数据。整个脚本都使用日志记录来跟踪进度和错误,使系统易于调试和维护。执行时,脚本会打印出原始列表和 LLM 生成的摘要,使用户能够了解网上有哪些类似产品。
5. Full pipeline orchestration via main.py: The main.py script acts as the entry point and orchestrator for the entire system. It first scans the assets/ folder to find the first available image. This image is processed by extract_text_from_image(), the resulting text is transformed into a search query by generate_search_query(), and then the query is passed into get_product_listings() to generate shopping links. Finally, summarize_product_pages() is called to fetch, parse, and summarize the product data. Logging is used throughout the script to track progress and errors, making the system easy to debug and maintain. When executed, the script prints out both the raw listings and LLM-generated summaries, offering the user insight into what similar products are available online.
以下Python脚本定义了一个模块化管道,该管道可根据视觉输入自动执行产品搜索和摘要:
The following Python script defines a modular pipeline that automates product search and summarization based on visual input:
1. 导入和设置:加载所有模块化组件,如 OCR、查询生成、网络抓取和摘要工具:
1. Imports and setup: It loads all modular components like OCR, query generation, web scraping, and summarization utilities:
导入操作系统
import os
导入日志
import logging
from image_utils import extract_text_from_image
from image_utils import extract_text_from_image
from search_utils import generate_search_query
from search_utils import generate_search_query
from web_scraper import get_product_listings
from web_scraper import get_product_listings
from summarizer import summarize_product_pages
from summarizer import summarize_product_pages
本节导入必要的模块和函数。每个模块负责一项特定的任务:
This section imports the necessary modules and functions. Each module is responsible for a specific task:
a. image_utils.py :包含 OCR 逻辑。
a. image_utils.py: It contains OCR logic.
b. search_utils.py :它包含基于 LLM 的查询生成。
b. search_utils.py: It contains LLM-based query generation.
c. web_scraper.py :它构建电子商务搜索网址。
c. web_scraper.py: It constructs e-commerce search URLs.
d. summarizer.py :它从这些 URL 获取内容并对其进行总结。
d. summarizer.py: It fetches content from those URLs and summarizes it.
该系统采用模块化设计,便于维护和扩展。
The system is built in a modular fashion for easy maintenance and scalability.
2. 日志配置:它设置格式化的日志,以帮助调试和监控执行流程,并带有时间戳消息:
2. Logging configuration: It sets up formatted logging to aid debugging and monitor execution flow with time-stamped messages:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
日志记录功能配置为以带时间戳的结构化格式输出信息性消息和错误信息。这对于调试非常有用,尤其是在 OCR 识别失败、图像路径错误或 Web 响应失败的情况下。
Logging is configured to output informational messages and errors in a time-stamped, structured format. This is useful for debugging, especially if OCR fails, the image path is incorrect, or the web response fails.
3. 图像查找工具:它会在assets/目录中查找第一个有效的图像,作为 OCR 输入源:
3. Image finder utility: It locates the first valid image in the assets/ directory to serve as the OCR input source:
def find_first_image_in_assets():
def find_first_image_in_assets():
assets_folder = "assets"
assets_folder = "assets"
如果 os.path.exists(assets_folder):
if not os.path.exists(assets_folder):
raise FileNotFoundError(f"找不到资产文件夹'{assets_folder}'")
raise FileNotFoundError(f"Assets folder '{assets_folder}' not found")
for file in os.listdir(assets_folder):
for file in os.listdir(assets_folder):
如果文件.lower().endswith(('.jpg', '.jpeg', '.png', '.webp')):
if file.lower().endswith(('.jpg', '.jpeg', '.png', '.webp')):
返回 os.path.join(assets_folder, file)
return os.path.join(assets_folder, file)
raise FileNotFoundError("资源文件夹中未找到图像。")
raise FileNotFoundError("No image found in the assets folder.")
此辅助函数会检查assets/目录,并返回找到的第一个有效图像文件。如果该文件夹不存在或不包含任何受支持的图像,则会引发错误。这确保了流程始终有视觉输入。
This helper function looks inside the assets/ directory and returns the first valid image file it finds. If the folder does not exist or contains no supported images, it raises an error. This ensures that the pipeline always has a visual input to begin with.
4. 主管道执行:它通过调用图像查找器并启动后续处理步骤来开始端到端流程:
4. Main pipeline execution: It begins the end-to-end flow by invoking the image finder and initiating subsequent processing steps:
def main():
def main():
尝试:
try:
image_path = find_first_image_in_assets()
image_path = find_first_image_in_assets()
此步骤首先调用图像查找工具,启动工作流程。文件路径在image_path中指定,该路径将在 OCR 步骤中使用。
This starts the workflow by calling the image finding utility. The file path is stormed in image_path, which will be used in the OCR step.
5. OCR文本提取:使用EasyOCR从定位的图像中提取文本;如果未找到文本,则优雅地失败:
5. OCR text extraction: Uses EasyOCR to extract text from the located image; fails gracefully if no text is found:
extracted_text = extract_text_from_image(image_path)
extracted_text = extract_text_from_image(image_path)
如果未提取文本:
if not extracted_text:
raise ValueError("无法从图像中提取文本")
raise ValueError("No text could be extracted from the image")
logging.info(f"OCR提取文本:\n{extracted_text}")
logging.info(f"OCR Extracted Text:\n{extracted_text}")
EasyOCR 在此处读取图像并返回识别出的文本字符串。如果未检测到任何文本(空字符串),则会引发异常。结果会被记录以方便追踪。
Here, EasyOCR reads the image and returns a string of recognized text. If no text is detected (empty string), an exception is raised. The result is logged for traceability.
6. 通过本地语言模型生成搜索查询:使用本地语言模型将提取的文本转换为干净、用户友好的搜索查询,并对其进行清理:
6. Search query generation via LLM: Converts extracted text into a clean, user-like search query using a local language model and sanitizes it:
查询 = generate_search_query(提取的文本)
query = generate_search_query(extracted_text)
如果未查询:
if not query:
raise ValueError("无法生成有效的搜索查询")
raise ValueError("Failed to generate a valid search query")
query = query.replace('"', '').replace("₹", "rs").replace("or less", "under").replace("alternative", "")
query = query.replace('"', '').replace("₹", "rs").replace("or less", "under").replace("alternative", "")
logging.info(f"搜索查询:\n{query}")
logging.info(f"Search Query:\n{query}")
a. 现在,原始 OCR 文本被传递给本地 LLM(通过 Ollama),LLM 将其转换为类似用户搜索查询的格式,例如:
a. The raw OCR text is now passed to a local LLM (via Ollama), which transforms it into a user-like search query such as:
“800卢比以下的有线耳机”
"wired headphones under rs 800"
为了与搜索 URL 兼容,我们对一些基本数据进行了清理,包括清除特殊字符和标准化货币符号(₹ | rs )。
Some basic sanitization is done to clean special characters and standardize currency symbols (₹ | rs) for compatibility with search URLs.
7. 重定向 URL 构建:根据生成的查询构建适用于亚马逊和 Flipkart 等平台的电子商务搜索结果 URL:
7. Redirect URL construction: Builds e-commerce search result URLs from the generated query for platforms like Amazon and Flipkart:
results = get_product_listings(query)
results = get_product_listings(query)
如果没有结果:
if not results:
logging.warning("未找到产品列表")
logging.warning("No product listings found")
返回
return
此查询通过调用`get_product_listings(query)`来构建亚马逊、Flipkart 和 eBay 的搜索 URL 。这些并非实际的 API 调用,而是直接重定向到相应网站的 URL。如果没有返回任何结果(这种情况不应该发生),则会记录一条警告信息。
This query is used to build search URLs for Amazon, Flipkart, and eBay by calling get_product_listings(query). These are not actual API calls but direct redirect URLs to the respective websites. If none are returned (which should not happen), a warning is logged.
8. 将 URL 打印到控制台:显示找到的产品的简要列表,包括名称、价格、商店和可点击的链接:
8. Print URLs to console: Displays a concise list of found products, including name, price, store, and a clickable link:
print("\n以下是一些类似或更好的替代方案,您可以查看:\n")
print("\nHere are some similar or better alternatives you can check out:\n")
for i, res in enumerate(results, 1):
for i, res in enumerate(results, 1):
print(f"{i}. {res['name']}")
print(f"{i}. {res['name']}")
print(f"价格:{res['price']}")
print(f" Price: {res['price']}")
print(f" 商店: {res['merchant']}")
print(f" Store: {res['merchant']}")
print(f"链接: {res['link']}\n")
print(f" Link: {res['link']}\n")
生成的搜索结果以简洁的格式显示。由于这些是重定向 URL(并非完整的商品列表),因此每个条目仅显示:
The generated search listings are printed in a clean format. Since these are redirect URLs (not full product listings), each entry just shows:
a. 商店名称
a. Store name
b. 链接到搜索结果页面
b. Link to the search result page
9. 网页摘要:它使用 LLM 从搜索结果页面中抓取并总结内容,以突出趋势和见解:
9. Summarize web snippets: It fetches and summarizes content from the search result pages using LLM to highlight trends and insights:
print("\n产品列表摘要:\n")
print("\nSummary of Product Listings:\n")
summaries = summarize_product_pages(results)
summaries = summarize_product_pages(results)
对于摘要中的 s:
for s in summaries:
print(f"{s['store']} 摘要:\n{s['summary']}\n")
print(f"{s['store']} Summary:\n{s['summary']}\n")
这是第二次 LLM 调用发生的地方。对于每个搜索 URL,程序执行以下操作:
This is where the second LLM call happens. For each search URL, the program does the following:
a. 使用请求获取网页。
a. Fetches the web page using request.
b. 从 HTML 中提取可见的产品文本。
b. Extracts visible product text from the HTML.
c. 使用 LLM 总结总体趋势(例如,顶级品牌、典型价格范围、共同特征)。
c. Summarizes the overall trend using the LLM (e.g., top brands, typical price ranges, common features).
d. 这提供了根据您的查询,每个商店结果中流行趋势的清晰概述。
d. This provides a readable overview of what is trending in each store's results based on your query.
10. 强大的错误处理:捕获并记录文件、值和意外错误,以确保优雅地处理故障并生成有意义的日志:
10. Robust error handling: Captures and logs file, value, and unexpected errors to ensure graceful failure and meaningful logs:
异常 FileNotFoundError as e:
except FileNotFoundError as e:
logging.error(f"文件错误:{str(e)}")
logging.error(f"File error: {str(e)}")
异常 ValueError as e:
except ValueError as e:
logging.error(f"值错误:{str(e)}")
logging.error(f"Value error: {str(e)}")
除异常 e 外:
except Exception as e:
logging.error(f"发生意外错误:{str(e)}")
logging.error(f"An unexpected error occurred: {str(e)}")
a. 捕获到的异常类型有以下三种:
a. Three types of exceptions that are caught are as follows:
i. 缺少文件夹或图像。
i. Missing folder or image.
ii. OCR 输出为空或 LLM 输出无效。
ii. Empty OCR or invalid LLM output.
iii. 任何其他意外错误。
iii. Any other unexpected errors.
这样可以确保管道优雅地处理故障并输出有用的日志。
This ensures the pipeline fails gracefully and outputs helpful logs.
b. 执行触发器:
b. Execution trigger:
如果 __name__ == "__main__":
if __name__ == "__main__":
主要的()
main()
这是脚本的执行入口点。它确保main()函数仅在脚本直接执行时运行,而不是在作为模块导入时运行。
This is the script’s execution entry point. It ensures the main() function only runs when the script is directly executed, not when imported as a module.
该输出结果代表了完整的OCR到LLM流程的最终结果,其中用户提供的产品图像被用于生成智能化的电子商务替代方案。图中所示为一款有线耳机的广告,价格和功能特性清晰可见。下图详细分析了系统的运行情况及其输出结果。
This output represents the final result of a complete OCR-to-LLM pipeline, where a user-provided image of a product is used to generate intelligent e-commerce alternatives. The image in question shows a wired headphone advertisement with visible price and features. The following figure provide a breakdown of the system's behavior and its resulting output.
系统首先从包含有线耳机详细信息的产品图片入手,利用 EasyOCR 技术准确提取描述性文本。然后,本地逻辑学习模型 (LLM) 将原始文本转换为符合实际且目标明确的搜索查询,专门针对特定价格范围内的知名品牌(例如索尼和 JBL)的同类产品进行搜索。基于此查询,系统构建了亚马逊、Flipkart 和 eBay 的直接搜索 URL,模拟用户浏览在线商店的方式。接着,系统从这些搜索页面中提取可见的文本片段,并使用相同的 LLM 对结果进行总结。以 Flipkart 为例,模型识别出页面内容缺乏实质性的产品信息,而是侧重于促销语言和紧迫感,例如限时优惠和快速发货信息。这一结果不仅凸显了模型提取和解读数据的能力,还体现了其评估不同平台内容质量和相关性的能力,最终使用户能够仅凭视觉信息做出明智的购物决策。
Starting with a product image containing details of a wired headphone, the system accurately extracts descriptive text using EasyOCR. This raw text is then transformed by a local LLM into a realistic and goal-oriented search query, specifically looking for alternatives from well-known brands like Sony and JBL within a set price range. Using this query, the system constructs direct search URLs for Amazon, Flipkart, and eBay, mimicking how a human might explore online stores. It then retrieves visible text snippets from those search pages and summarizes the results using the same LLM. In the case of Flipkart, the model identifies that the page content lacks substantive product information and instead focuses on promotional language and urgency cues, such as limited-time offers and fast delivery messaging. This response not only highlights the model’s ability to extract and interpret data but also to assess the quality and relevance of content across different platforms, ultimately empowering users to make informed shopping decisions based on visual inputs alone.
对 PDF 等多模态文档进行 OCR 识别,需要提取并解读单个文件中包含的文本、图像和结构等多种内容。与纯图像不同,PDF 通常包含文本、扫描页面、表格、图像以及页眉、页脚和多列章节等布局元素。由虚拟语言模型 (VLM) 或 Mistral OCR 等基础模型驱动的高级 OCR 系统能够对这些文档进行整体处理,识别阅读顺序、提取表格和图形、保留格式并捕捉语义信息。当与基于模式的提取或文档问答( QA ) 功能集成时,即可实现对合同、发票、报告或学术论文的自动理解,从而使基于 PDF 的工作流程更加智能、可搜索且可由机器执行。
OCR on multimodal documents like PDFs involves extracting and interpreting a mix of textual, visual, and structural content within a single file. Unlike plain images, PDFs often include typed text, scanned pages, tables, images, and layout elements such as headers, footers, and multi-column sections. Advanced OCR systems powered by VLMs or foundation models like Mistral OCR, models can process these documents holistically, identifying reading order, extracting tables and figures, preserving formatting, and capturing semantic meaning. When integrated with schema-based extraction or document question and answer (QA) capabilities, this enables automated understanding of contracts, invoices, reports, or academic papers, making PDF-based workflows intelligent, searchable, and machine-actionable.
以下图文片段说明了视觉数据(图表+表格)和文本数据如何共同展现1996年至2022年教育程度的发展历程:
The following image-text snippet illustrates how visual (chart + table) and textual data together convey the progression of educational attainment from 1996 to 2022:
Figure 16.7: An example of a multimodal document suitable for OCR processing
图 16.7展示了一个多模态数据源示例,该数据源结合了结构化可视化元素(柱状图和表格)和非结构化文本描述,构成了一个丰富而复杂的输入,非常适合基于 OCR 的文档理解。在多模态 AI 系统中,此类图像不仅需要文本提取,还需要布局解析、图形元素和文本元素之间的语义对齐以及上下文推理。利用由 LLM 驱动的先进 OCR 技术,可以实现精确转录、结构保留和有意义的解析,从而将静态视觉内容转化为可操作的、机器可读的洞察。这为智能分析报告、教育趋势和政策文件奠定了关键基础。
Figure 16.7 exemplifies a multimodal data source that combines structured visuals (bar charts and tables) with unstructured textual descriptions, representing a rich and complex input ideal for OCR-driven document understanding. In the context of multimodal AI systems, such images require not only text extraction but also layout interpretation, semantic alignment between graphical and textual elements, and contextual reasoning. Leveraging advanced OCR techniques powered by LLMs enables accurate transcription, structure preservation, and meaningful interpretation, transforming static visual content into actionable, machine-readable insights. This forms a critical foundation for intelligent analysis of reports, educational trends, and policy documents.
Mistral 的 OCR 技术栈将传统的被动式 OCR 转录转变为主动式、结构化且交互式的系统。它提供分层功能:保留布局的文本提取、基于模式的数据捕获以及通过 LLM 实现的上下文感知查询。这些功能——包括基础 OCR、注释和文档质量保证——使开发人员能够构建复杂的文档智能应用程序,从提取表格和为图表添加注释到创建类似聊天的助手。它们共同构成了一个多功能的基础,可用于构建实际的 GenAI 流水线,如下所示,支持跨多种文档类型的丰富多模态工作流程:
Mistral’s OCR stack transforms traditional OCR from passive transcription into an active, structured, and interactive system. It offers layered capabilities: layout-preserving text extraction, schema-based data capture, and context-aware querying via LLMs. These functions—Basic OCR, annotations, and document QA—enable developers to build sophisticated document intelligence applications, from extracting tables and captioning figures to creating chat-like assistants. Together, they form a versatile foundation for real-world GenAI pipelines, as shown in the following, supporting rich multimodal workflows across diverse document types:
步骤 1:安装 Mistral 客户端
!pip install mistralai --quiet
# 步骤 2:导入所需模块
导入操作系统
从 mistralai.client 导入 Mistral
from mistralai.models.chat_completion import ChatMessage
# 第三步:设置 API 密钥(确保密钥设置安全)
os.environ["MISTRAL_API_KEY"] = "your_mistral_api_key" # 替换为您的实际密钥
api_key = os.environ["MISTRAL_API_KEY"]
# 上传并获取文档 URL
with open("/content/sample_data/educational_attainment_figure.pdf", "rb") as f:
uploaded_file = client.files.upload(
文件={"file_name": "educational_attainment_figure.pdf", "content": f},
目的="ocr"
)
signed_url = client.files.get_signed_url(file_id=uploaded_file.id)
提问
消息 = [
聊天消息
role="user",
内容=[
{"type": "text", "text": "总结多年来学校后教育的发展情况。"},
{"type": "document_url", "document_url": signed_url.url}
]
)
]
response = client.chat.complete(
model="mistral-small-latest",
messages=messages
)
print(response.choices[0].message.content)
# Step 1: Install the Mistral client
!pip install mistralai --quiet
# Step 2: Import required modules
import os
from mistralai.client import Mistral
from mistralai.models.chat_completion import ChatMessage
# Step 3: Set API key (ensure your key is securely set)
os.environ["MISTRAL_API_KEY"] = "your_mistral_api_key" # Replace with your actual key
api_key = os.environ["MISTRAL_API_KEY"]
# Upload and get document URL
with open("/content/sample_data/educational_attainment_figure.pdf", "rb") as f:
uploaded_file = client.files.upload(
file={"file_name": "educational_attainment_figure.pdf", "content": f},
purpose="ocr"
)
signed_url = client.files.get_signed_url(file_id=uploaded_file.id)
# Ask a question
messages = [
ChatMessage(
role="user",
content=[
{"type": "text", "text": "Summarize post-school education growth over the years."},
{"type": "document_url", "document_url": signed_url.url}
]
)
]
response = client.chat.complete(
model="mistral-small-latest",
messages=messages
)
print(response.choices[0].message.content)
以下正则表达式代码的目的是自动检测和提取用户输入消息中的 URL,特别是针对 PDF 等文档链接。
The purpose of the following regex code is to automatically detect and extract URLs from a user's input message, specifically targeting document links such as PDFs.
Mistral 的文档质量保证功能允许您在文本旁边附加文档 URL。提示。但用户可能会输入类似这样的内容:你能总结一下这篇论文吗?https://arxiv.org/pdf/2410.07073
Mistral's document QA feature allows you to attach document URLs alongside your text prompt. But users might type something like: Can you summarize this paper? https://arxiv.org/pdf/2410.07073
为了实现这一点,系统需要:
To make this work, the system needs to:
这里需要导入 re# 以支持基于正则表达式的 URL 提取。
def extract_urls(text: str) -> list:
url_pattern = r'\b((?:https?|ftp)://(?:www\.)?[^\s/$.?#].[^\s]*)\b'
urls = re.findall(url_pattern, text)
返回网址
import re# is required here to support the regex-based URL extraction.
def extract_urls(text: str) -> list:
url_pattern = r’\b((?:https?|ftp)://(?:www\.)?[^\s/$.?#].[^\s]*)\b’
urls = re.findall(url_pattern, text)
return urls
user_message_content = [{"type": "text", "text": user_input}]
对于 document_urls 中的每个 url:
user_message_content.append({"type": "document_url", "document_url": url})
user_message_content = [{"type": "text", "text": user_input}]
for url in document_urls:
user_message_content.append({"type": "document_url", "document_url": url})
下图展示了一个典型的半结构化收据示例,这种收据常见于诸如综合收据数据集( CORD )等数据集中。这些收据包含丰富的文本信息,包括详细的产品清单、数量、单价、税费计算和付款总额汇总,所有信息均以视觉上较为复杂的布局呈现。从这类文档中提取结构化信息是现代文档理解研究的一项基础性任务。该图像可作为评估 OCR 和文档解析系统的实际基准,尤其适用于使用 Mistral OCR 等基础模型或 Llama 3.2 Vision(通过 Ollama)等多模态模型进行键值提取和表格检测。
The following figure represents a typical example of a semi-structured receipt commonly found in datasets like the Consolidated Receipt Dataset (CORD). These receipts contain rich textual information, including itemized product listings, quantities, unit prices, tax calculations, and total payment summaries, all formatted in visually complex layouts. Extracting structured information from such documents is a foundational task in modern document understanding research. This image serves as a real-world benchmark to evaluate OCR and document parsing systems, particularly for key-value extraction and table detection using foundation models like Mistral OCR or multimodal models such as Llama 3.2 vision via Ollama.
Figure 16.8: A receipt that consists of textual tabular data
以下 Python 代码演示了如何使用Meta 的 Llama 3.2 视觉模型,通过 Ollama 运行时环境进行基于图像的文档理解。该方法集成了计算机视觉和自然语言理解,允许用户上传图像并使用自然语言进行查询,大型多模态模型会生成结构化的响应。该代码专为Google Colab等环境设计,图像文件存储在默认数据目录中。
The following Python code demonstrates the way to perform image-based document understanding using Meta's Llama 3.2 vision model via the Ollama runtime. This approach integrates computer vision and natural language understanding by allowing a user to upload an image and query it in natural language, with the large multimodal model producing a structured response. The code is designed for use in environments like Google Colab, where the image file is stored in a default data directory.
该流程的核心逻辑在于调用 ` ollama.chat()`方法,其中模型参数设置为`llama3.2-vision` ,表明正在使用支持视觉功能的 Llama 3.2 实例。提示信息“获取图像中的所有数据”作为消息内容以用户角色发送,图像本身则以列表形式通过`images`键传递。LLM 处理图像后,会在响应对象的`message['content']`字段中返回结构化的文本响应。`strip ()`函数确保在显示响应之前移除所有前导或尾随空格。本例中,模型输出包含详细的发票元数据,例如公司名称、地址、账单接收人和明细条目,展现了模型解析布局丰富的文档(例如发票)的能力。此示例表明,与传统的 OCR 相比,该模型不仅能够直接从图像输入中捕获文本,还能捕获上下文、关系和层级结构,从而显著提升了文档自动化应用场景的智能化程度。
The core logic of the pipeline involves invoking the ollama.chat() method, where the model parameter is set to llama3.2-vision, indicating that a vision-enabled Llama 3.2 instance is being used. The prompt get all the data from the image is sent as the message content under the user role, and the image itself is passed in a list under the images key. Once the LLM processes the image, it returns a structured textual response within the message['content'] field of the response object. The strip() function ensures that any leading or trailing whitespace is removed from the response before displaying it. The model output in this case includes detailed invoice metadata such as company name, address, billing recipient, and line-item entries, showcasing the model’s ability to parse layout-rich documents like invoices. This example illustrates a significant advancement over traditional OCR by capturing not just text but also context, relationships, and hierarchies directly from image input, thus facilitating more intelligent document automation use cases.
导入羊驼
image_path = # 替换为你的图片路径
import ollama
image_path = # Replace with your image path
这行代码设置了要处理的图像的路径。在典型的 Colab 设置中,这将是/content/sample_data/invoice_sample.jpg 。
This line sets the path of the image to be processed. In a typical Colab setup, this would be /content/sample_data/invoice_sample.jpg.
Python
python
编辑
CopyEdit
响应 = ollama.chat(
response = ollama.chat(
型号="llama3.2-vision",
model="llama3.2-vision",
messages=[{
messages=[{
"角色": "用户",
"role": "user",
"content": "从图像中获取所有数据",
"content": "get all the data from the image",
"images": [image_path]
"images": [image_path]
}],
}],
)
)
此模块使用 Ollama 的 API 与 Llama 3.2 视觉模型进行交互。该模型处理图像并返回其内容的文本分解,理想情况下,该分解应为收据的结构化摘要。
This block uses Ollama’s API to interact with the Llama 3.2 vision model. The model processes the image and returns a textual breakdown of its content, ideally a structured summary of the receipt.
cleaned_text = response['message']['content'].strip()
cleaned_text = response['message']['content'].strip()
清除响应中的空白字符,以便进行后续处理。
The response is cleaned of whitespace to prepare it for further processing.
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
这部分内容为提示链式调用搭建了 LangChain 生态系统。用户定义一个模板提示,指示模型提取并返回特定字段(例如,公司名称、收据编号、商品列表、总计)。
This portion sets up the LangChain ecosystem for prompt chaining. The user defines a template prompt instructing the model to extract and return specific fields (e.g., company name, receipt number, item list, total).
llm = ChatOllama(模型=“llama3”,温度=0)
llm = ChatOllama(model="llama3", temperature=0)
这将使用 Ollama(目前是纯文本模型)初始化与标准 Llama 3 模型的 LLM 连接。
This initializes a LLM connection to the standard Llama 3 model using Ollama (text-only model now).
chain = (prompt | llm | StrOutputParser())
chain = (prompt | llm | StrOutputParser())
返回 chain.invoke({"response": response})
return chain.invoke({"response": response})
这里,提示信息与模型和解析器连接起来,然后使用清理后的 OCR 文本执行。输出结果应为 JSON 格式。
Here, the prompt is chained with the model and parser, then executed with the cleaned OCR text. The output is expected to be JSON-formatted.
json_match = re.search(r"```\n(.*?)\n```", result, re.DOTALL)
json_match = re.search(r"```\n(.*?)\n```", result, re.DOTALL)
该函数会在模型响应中查找用三个反引号“`”括起来的 JSON 块。
This searches for a JSON block enclosed in triple backticks ``` inside the model response.
parsed_data = json.loads(receipt_data)
parsed_data = json.loads(receipt_data)
提取后,使用json.loads将 JSON 字符串解析为 Python 字典。
Once extracted, the JSON string is parsed into a Python dictionary using json.loads.
receipt_dict = json.loads(json_data)
items_df = pd.DataFrame(receipt_dict['Items'])
receipt_dict = json.loads(json_data)
items_df = pd.DataFrame(receipt_dict['Items'])
收据字典会通过将Items列表转换为 Pandas DataFrame 进行进一步处理,从而可以进行数据分析、聚合或可视化等进一步操作。
The receipt dictionary is further processed by converting the Items list into a Pandas DataFrame, which enables further operations like data analysis, aggregation, or visualization.
这段代码展示了一个类似 RAG 的多模态系统,它结合了图像理解(Llama 3.2 视觉)、基于提示的语义提取(LangChain)和结构化输出(JSON | DataFrame)。它有力地证明了基础模型如何在自动化的端到端流程中连接非结构化的视觉输入和结构化的分析。
This code exemplifies a multimodal RAG-like system, combining image understanding (Llama 3.2 vision), prompt-based semantic extraction (LangChain), and structured output (JSON | DataFrame). It is a compelling example of how foundational models can bridge unstructured visual inputs and structured analytics in an automated, end-to-end pipeline.
作为本章的延伸,我们鼓励读者使用CORD数据集探索现实世界的 OCR 挑战。CORD 是一个公开数据集,专门用于从商店收据中提取信息。该数据集包含图像 PDF 和相应的 JSON 注释,使其成为测试文档理解系统在半结构化财务文档上性能的理想选择。读者可以通过训练词元分类器或使用布局感知提示策略,尝试提取商家名称、商品明细、总计和税额。关键在于超越简单的文本提取,开发能够理解文档语义和格式的端到端流程。
As an extension to this chapter, readers are encouraged to explore real-world OCR challenges using the CORD, a publicly available dataset curated for information extraction from store receipts. This dataset consists of image-PDFs and corresponding JSON annotations, making it an ideal candidate for testing document understanding systems on semi-structured financial documents. Readers can experiment with extracting merchant names, itemized purchases, totals, and tax values, either by training their token classifiers or using layout-aware prompting strategies. The key task is to go beyond raw text extraction and develop end-to-end pipelines that understand document semantics and formatting.
在实现方面,读者可以选择两种前沿方法之一。首先,他们可以利用 Mistral 的文档质量保证 API,该 API 可自动应用 OCR 技术,并允许使用文档 URL 进行结构化质量保证。这种方法具有可扩展性,且设置极少。其次,读者可以尝试使用 Meta 的 Llama 3.2 视觉模型,该模型基于 Ollama 运行时环境,支持多模态图像输入。在这种设置下,可以将收据作为图像传递给模型,并附带定制的提示(例如,列出此收据中的所有商品及其价格),从而实现视觉语义推理。这项任务旨在鼓励学生结合数据集工程、提示设计和多模态语言学习模型,创建稳健、高精度的文档理解系统。
For implementation, readers may choose one of two cutting-edge approaches. First, they can leverage Mistral’s document QA API, which automatically applies OCR and allows for structured QA using document URLs. This approach is scalable and requires minimal setup. Alternatively, readers can experiment with Meta’s Llama 3.2 vision model using the Ollama runtime, which supports multimodal image inputs. In this setup, receipts can be passed as images to the model with tailored prompts (e.g., list all the items and their prices from this receipt), enabling visual-semantic reasoning. This task encourages students to combine dataset engineering, prompt design, and multimodal LLMs to create robust, high-accuracy document understanding systems.
在本章中,我们探讨了三种在多模态数据环境下执行 OCR 的不同但又互补的方法。首先,我们展示了如何将 EasyOCR 等传统 OCR 工具集成到 GenAI 流程中,以从图像中提取文本并进行推理,从而实现对非结构化视觉输入的智能解读。其次,我们介绍了 Mistral OCR,我们构建了一个专为文档理解而训练的基础模型,该模型通过 API 驱动的文档质量保证 (QA) 提供结构化输出,从而简化了复杂 PDF 的 OCR 处理流程。最后,我们考察了多模态语言学习模型 (LLM)(例如 Meta 的 Llama Vision 系列)在处理嵌入表格数据的收据图像方面的强大功能,重点展示了它们能够同时解析布局、提取内容并生成语义结构化输出。这些方法共同构成了一套强大的工具包,用于构建下一代 OCR 系统,从而弥合原始视觉输入与可操作的结构化理解之间的鸿沟。
In this chapter, we explored three distinct yet complementary approaches to performing OCR in the context of multimodal data. First, we demonstrated how traditional OCR tools like EasyOCR can be wrapped within a GenAI pipeline to extract and reason over text from images, enabling intelligent interpretation of unstructured visual inputs. Second, we introduced Mistral OCR, a foundation model natively trained for document understanding, which streamlines OCR on complex PDFs by providing structured outputs through API-driven document QA. Lastly, we examined the power of multimodal LLMs, such as Meta’s Llama vision series, in handling receipt images with embedded tabular data, highlighting their ability to simultaneously interpret layout, extract content, and generate semantically structured outputs. Together, these methods provide a robust toolkit for building next-generation OCR systems that bridge the gap between raw visual input and actionable structured understanding.
下一章,我们将重点介绍如何使用 GenAI 对传统模型进行封装,例如推荐引擎。
In the next chapter, we will focus on wrapping traditional models with GenAI, e.g., recommendation engines.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
随着传统机器学习( ML ) 和生成式人工智能( GenAI )之间的界限日益模糊,构建融合二者优势的混合系统变得越来越有价值。本章将探讨如何将分类器、回归器和聚类算法等传统 ML 模型封装并集成到 GenAI 智能体的工作流程中。通过使这些模型能够作为工具在智能体推理循环中调用,我们释放了强大的功能,使生成式智能体不仅可以进行对话和生成,还可以进行精准的预测、分类和推荐。
As the boundaries between traditional machine learning (ML) and generative AI (GenAI) continue to blur, there is increasing value in creating hybrid systems that combine the strengths of both. In this chapter, we explore how to wrap and integrate conventional ML models, such as classifiers, regressors, and clustering algorithms, into GenAI agent workflows. By making these models callable as tools within agentic reasoning loops, we unlock powerful capabilities where generative agents can not only converse and generate but also predict, classify, and recommend with precision.
您将学习如何利用 scikit-learn、LangChain 和轻量级 Python 微服务等技术,通过应用程序编程接口( API ) 公开机器学习模型,并使其与 GenAI 代理无缝交互。我们将逐步讲解实际应用,包括将推荐引擎作为可调用工具集成到大型语言模型( LLM ) 推理链中。在此过程中,我们将解决关键的运维挑战,例如 API 延迟、错误处理和模型版本控制,从而确保生产环境系统的稳健性和可靠性。
Using technologies like scikit-learn, LangChain, and lightweight Python microservices, you will learn how to expose ML models via application programming interfaces (APIs) and make them interact seamlessly with GenAI agents. We will walk through practical implementations, including a recommendation engine integrated as a callable tool within a large language model (LLM) reasoning chain. Along the way, we will address key operational challenges such as API latency, error handling, and versioning of models, ensuring robustness and reliability in production-ready systems.
在本章结束时,您将构建一个功能齐全的混合系统,其中 GenAI 代理会在其思维链( CoT ) 中动态调用机器学习预测。这种推理与预测的融合为构建不仅能流畅对话而且具有强大分析能力的智能系统铺平了道路。
By the end of this chapter, you will have built a fully functioning hybrid system where GenAI agents dynamically invoke ML predictions as part of their chain of thought (CoT). This fusion of reasoning and prediction paves the way for intelligent systems that are not only conversationally fluent but also analytically powerful.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在指导读者完成混合人工智能系统的端到端开发,该系统融合了传统机器学习和现代全基因组人工智能(GenAI)。具体而言,本章演示了如何训练、部署和封装基于极端梯度提升(XGBoost )的欺诈检测模型,并将其封装成应用程序接口(API),然后使用类似 Mistral 的语言学习模型(LLM)通过自然语言与之交互。读者将学习如何从文本中提取结构化特征、以编程方式调用机器学习工具、解释模型输出以及生成可操作的解释,所有这些都在一个模块化、可用于生产环境的架构中完成。其目标是通过全基因组人工智能代理,使传统的机器学习模型更易于访问、解释和使用。
The objective of this chapter was to guide readers through the end-to-end development of a hybrid AI system that integrates traditional ML with modern GenAI. Specifically, it demonstrated how to train, deploy, and wrap an Extreme Gradient Boosting (XGBoost) fraud detection model as an API, and then use an LLM like Mistral to interface with it via natural language. Readers will learn how to extract structured features from text, call ML tools programmatically, interpret model outputs, and generate actionable explanations, all within a modular, production-ready architecture. The goal was to make traditional ML models accessible, explainable, and usable through GenAI agents.
X Analytics是一家拥有 15 名成员的人工智能初创公司,专注于为中型电商平台提供智能零售解决方案。过去两年,该团队开发了一系列传统机器学习模型,包括基于协同过滤的推荐引擎、使用 XGBoost 的客户流失预测模型以及基于自定义逻辑回归分类器训练的产品分类模型。这些模型此前已手动集成到仪表盘或批处理作业中,但缺乏实时交互功能。
Company X Analytics is a 15 member AI startup specializing in intelligent retail solutions for mid-sized e-commerce platforms. Over the past two years, the team developed a suite of traditional ML models, including a collaborative filtering based recommendation engine, a churn prediction model using XGBoost, and a product categorization model trained on custom logistic regression classifiers. These models had been manually integrated into dashboards or batch jobs, but lacked real-time, interactive utility.
随着 GenAI 的发展势头强劲,该公司订阅了OpenAI 的生成式预训练 Transformer ( GPT ) 和Anthropic 的 Claude等商业语言学习模型 (LLM ),旨在构建一个对话式助手,帮助零售经理做出更明智、更快速的决策。然而,挑战显而易见:如何将传统机器学习模型中蕴含的智能与语言学习模型的推理能力和语言流畅性相结合。
As GenAI gained momentum, the company subscribed to commercial LLMs like OpenAI's Generative Pre-trained Transformer (GPT) and Anthropic’s Claude, intending to build a conversational assistant that could help retail managers make smarter, faster decisions. However, the challenge was clear, which was how to bridge the intelligence embedded in their traditional ML models with the reasoning and language fluency of LLMs.
以下案例研究展示了 RastrAI 如何通过将传统模型封装到 GenAI 工作流中,从而连接传统机器学习和 GenAI。通过结合 LangChain 代理、RESTful 机器学习微服务和 CoT 提示,该公司构建了一个智能系统,能够以推理和预测的双重准确度回答复杂的业务问题。
The following case study illustrates how RastrAI bridged traditional ML and GenAI by wrapping legacy models into a GenAI workflow. By combining LangChain agents, RESTful ML microservices, and CoT prompting, the company built an intelligent system capable of answering complex business questions with both reasoning and predictive accuracy.
这些查询既需要自然语言理解,又需要直接访问现有的机器学习见解,而机器学习模型本身无法做到这一点。
These queries required both natural language understanding and direct access to existing ML insights, something that LLMs could not do out of the box.
团队随后定义了与这些 API 直接对应的 LangChain 工具对象。在 CoT 提示下,LLM 被指示根据用户的意图调用正确的工具。例如,如果经理询问“本月有哪些高风险客户?” ,代理将解析输入,调用流失预测 API,并以流畅的语言返回可操作的见解。
The team then defined LangChain tool objects that mapped directly to these APIs. With CoT prompting, the LLM was instructed to invoke the right tool-based on the user's intent. For example, if a manager asked, what are some high-risk customers this month? The agent would parse the input, call the churn prediction API, and return actionable insights in fluent language.
虽然本案例研究为虚构,但它反映了许多现代企业面临的真实场景。如今,企业正通过订阅或API集成在机器学习模型(LLM)上投入巨资,同时又拥有丰富的传统人工智能和机器学习模型,涵盖从推荐引擎到风险评分系统等各种类型。与其对庞大且成本高昂的机器学习模型进行微调,或从头开始重建现有解决方案,企业可以采用这种混合方法来最大化价值。通过将传统模型封装为机器学习模型驱动的代理中的可调用工具,企业可以创建智能系统,将特定领域的洞察与全人工智能(GenAI)的自然语言推理能力相结合,从而在加速创新的同时,保护过去的投资。
While this case study is fictional, it reflects a real scenario faced by many modern enterprises. Today, organizations are heavily investing in LLMs through subscriptions or API integrations, while simultaneously sitting on a rich legacy of traditional AI and ML models, ranging from recommendation engines to risk scoring systems. Instead of fine-tuning large, costly LLMs or rebuilding existing solutions from scratch, companies can adopt this hybrid approach to maximize value. By wrapping their traditional models as callable tools within LLM-powered agents, they can create intelligent systems that combine domain-specific insights with the natural language reasoning of GenAI, accelerating innovation while preserving past investments.
随着人工智能领域迈入GenAI时代,挑战与机遇在于如何将传统人工智能模型与新兴的机器学习模型(LLM)的能力相融合。企业通常拥有大量预先构建的机器学习和深度学习模型,这些模型是为特定的预测或感知任务而设计的。与其弃用或微调现有的LLM模型以适应这些用例,不如采用更灵活的方式。这种模块化且经济高效的方法将传统模型封装成可调用工具,并通过基于LLM的代理对其进行协调。这使得智能系统成为可能,其中LLM作为推理层,而传统模型则执行高精度预测任务。
As the field of AI transitions into the era of GenAI, the challenge and opportunity lie in bridging traditional AI models with the emergent capabilities of LLMs. Enterprises often possess a portfolio of pre-existing ML and deep learning models designed for specific predictive or perceptual tasks. Rather than discarding or fine-tuning LLMs for these use cases, a more modular and cost-effective approach involves wrapping traditional models as callable tools and orchestrating them via LLM-based agents. This enables intelligent systems where LLMs serve as the reasoning layer, while traditional models perform high-accuracy predictive tasks.
本节探讨如何利用工具增强型代理,将各种传统人工智能/机器学习模型(从分类器和回归器到卷积神经网络( CNN ) 和光学字符识别( OCR ))无缝集成到 GenAI 工作流程中。通过 API 公开模型,并使语言学习模型 (LLM) 能够解释其输出并采取行动,开发人员可以构建兼具预测准确性和自然语言交互能力的智能系统,详情如下:
This section explores how various traditional AI/ML models, from classifiers and regressors to convolutional neural networks (CNNs) and optical character recognition (OCR), can be seamlessly integrated into GenAI workflows using tool-augmented agents. By exposing models via APIs and enabling LLMs to interpret and act on their outputs, developers can build intelligent systems that combine predictive accuracy with natural language interaction, details as follows:
在第 16 章“GenAI 从图像中提取文本”中,我们探讨了如何通过将 OCR 模型封装为可调用工具,将 OCR 功能与 GenAI 集成。
In Chapter 16, GenAI for Extracting Text from Images, GenAI for Extracting Text from Images we have explored how to integrate OCR capabilities with GenAI by wrapping OCR models as callable tools.
在混合型 GenAI/ML 系统中,传统的机器学习流程既可以通过用户向 LLM 发出的交互式指令触发,也可以通过代理协调的批量工作流自动触发。当用户直接交互时,他们会发出自然语言查询,例如“你能预测一下这位客户的流失风险吗?”或者“从这张收据中提取文本并总结关键信息。” LLM 解释意图,构建所需的输入,并通过预定义的 API 包装器调用相应的 ML 工具,例如流失预测模型或 OCR 服务。
In a hybrid GenAI/ML system, traditional ML processes can be triggered either interactively via user instructions to the LLM or automatically through batch workflows orchestrated by the agent. When users engage directly, they issue natural language queries such as can you predict the churn risk for this customer? or extract text from this receipt and summarize the key details. The LLM interprets the intent, structures the required inputs, and calls the corresponding ML tool, such as a churn prediction model or an OCR service, via predefined API wrappers.
此外,在批量或后台处理中,LLM 代理可以遍历任务队列(例如,每日图像文件夹、交易日志),并自主调用传统模型。例如,一个定时代理可以每晚使用 OCR 分析所有上传的发票,并将提取的数据传递给财务异常检测器。这些操作通过类似 LangChain 的编排层或微服务管道进行初始化,这些编排层或管道监控触发器或工作流,并据此协调工具调用。
Alternatively, in batch or background processing, the LLM agent may iterate over a queue of tasks (e.g., daily image folders, transaction logs) and autonomously invoke traditional models. For instance, a scheduled agent may analyze all uploaded invoices every night using OCR and pass the extracted data to a financial anomaly detector. These operations are initialized through LangChain-like orchestration layers or microservice pipelines that monitor triggers or workflows and coordinate tool invocations accordingly.
该设计支持按需推理和自动 ML 执行,使组织能够将 GenAI 的灵活性与传统模型的精确性相结合,从而在最大限度地减少人工干预的情况下实现欺诈检测、推荐引擎和客户分析等应用。
This design supports both on-demand reasoning and automated ML execution, allowing organizations to combine GenAI's flexibility with the precision of legacy models, enabling applications like fraud detection, recommendation engine, and customer analytics with minimal manual intervention.
本文以电信欺诈检测为例,探讨了混合集成学习方法的应用。电信行业面临的欺诈检测挑战依然严峻,身份欺骗、SIM卡克隆、非法理赔等欺诈活动会对收入和客户信任构成重大威胁。与合法交易相比,欺诈事件的稀少性导致数据集高度不平衡,使得传统的分类方法难以有效应对。本案例研究提出了一种基于集成机器学习的方法,并结合深度学习技术,用于检测真实电信理赔数据集中的欺诈行为。该数据集的欺诈与非欺诈数据比例为6:94。
This is a case study of hybrid ensemble learning for telecom fraud detection, as fraud detection remains a critical challenge in the telecommunications industry, where fraudulent activities, such as identity spoofing, Subscriber Identity Module (SIM) cloning, and illegitimate claim submissions, pose substantial risks to revenue and customer trust. The rarity of fraudulent instances compared to legitimate transactions results in highly imbalanced datasets, making conventional classification methods inadequate. This case study presents an ensemble-based ML approach, augmented by deep learning methods, to detect fraud in a real-world telecom claims dataset characterized by a 6:94 fraud-to-non-fraud ratio.
该数据集包含匿名化的理赔相关特征,包括客户元数据、理赔提交间隔和验证标志。值得注意的是,诸如IS_MISSING_MOBILE 、HOUR_TO_RAISE_CLAIM和TOTAL_VERIFICATIONS等变量具有特定领域的语义价值,并未采用统计方法进行插补。相反,这些特征采用基于标志的方法进行编码,以保持可解释性。分类特征采用标签编码,数值属性则使用 Z 分数归一化进行标准化。缺失值和零膨胀特征均经过可视化和显式处理,以确保下游模型的稳健性。
The dataset comprised anonymized claim-related features, including customer metadata, claim submission intervals, and verification flags. Notably, variables such as IS_MISSING_MOBILE, HOUR_TO_RAISE_CLAIM, and TOTAL_VERIFICATIONS carried domain-specific semantic value and were not imputed using statistical means. Instead, such features were encoded using flag-based approaches to preserve interpretability. Categorical features were label-encoded, and numerical attributes were standardized using Z-score normalization. Missing values and zero-inflated features were visualized and handled explicitly to ensure robust downstream model behavior.
我们开发了一个初始的 XGBoost 模型,并引入了scale_pos_weight参数来解决类别不平衡问题。我们没有采用默认的0.5决策阈值,而是应用了一种阈值调优机制。我们计算了多个阈值下的精确率、召回率和 F1 分数,并选择最优阈值以最大化 F1 分数,从而在欺诈检测(召回率)和误报率(精确率)之间取得平衡。
An initial XGBoost model was developed, incorporating the scale_pos_weight parameter to address class imbalance. Instead of relying on the default decision threshold of 0.5, a threshold tuning mechanism was applied. Precision, recall, and F1 scores were computed across multiple thresholds, and the optimal cutoff was selected to maximize the F1 score, achieving a trade-off between fraud detection (recall) and false alarm reduction (precision).
我们使用标准分类指标评估模型性能,包括混淆矩阵、精确率-召回率(PR )曲线、受试者工作特征(ROC )曲线、马修斯相关系数(MCC )和Cohen's Kappa系数。这种多指标评估方法能够全面展现模型在不平衡数据集条件下的可靠性。
Performance was evaluated using standard classification metrics, including the confusion matrix, precision-recall (PR) curve, receiver operating characteristic (ROC) curve, Matthews correlation coefficient (MCC), and Cohen’s Kappa score. This multi-metric evaluation provided a comprehensive view of model reliability under imbalance conditions.
为了进一步提升模型的泛化能力和鲁棒性,我们构建了一个堆叠式集成分类器。该分类器由XGBoost、LightGBM和梯度提升分类器组成。它们的个体概率输出被传递给一个元分类器——逻辑回归,该元分类器学习如何最优地组合它们的输出。集成模型采用分层训练集/测试集划分进行训练,并使用与基线模型相同的指标进行评估。
To further improve generalization and model robustness, a stacked ensemble classifier was constructed. The base learners included XGBoost, LightGBM, and the gradient boosting classifier. Their individual probability outputs were passed to a meta-classifier, logistic regression, which learned to optimally combine their outputs. The ensemble was trained using a stratified train-test split and evaluated on the same metrics as the baseline model.
与任何单一模型相比,堆叠集成模型展现出更优异的性能。它在保持较高精确率的同时,显著提高了欺诈检测的召回率,从而最大限度地减少了漏报和误报。ROC-AUC 和 PR-AUC 值均显著提升,MCC 和 Kappa 值也证实了模型稳定性的增强。
The stacked ensemble demonstrated superior performance compared to any single model. It yielded higher recall for fraud detection while maintaining competitive precision, thus minimizing both false negatives and false positives. The ROC-AUC and PR-AUC scores improved notably, and the MCC and Kappa values confirmed increased model stability.
该研究强调了在高度不平衡数据集上,采用堆叠架构组合基于树的分类器进行欺诈检测的有效性。此外,阈值优化和领域信息预处理对于提高实际应用性至关重要。所提出的方法可以集成到生产系统中用于欺诈风险评分,并支持基于SHAP的可解释性或实时欺诈监控API的扩展。如果我们需要将上述集成模型或XGBoost模型(如欺诈检测案例研究中所述)与LLM结合使用,则LLM将作为围绕已训练预测模型的推理、协调和解释层。
The study underscores the efficacy of combining tree-based classifiers in a stacked architecture for fraud detection in highly imbalanced datasets. Moreover, threshold optimization and domain-informed preprocessing were essential for improving real-world applicability. The proposed approach can be integrated into production systems for fraud risk scoring and supports extensibility for SHAP-based interpretability or real-time fraud monitoring APIs. If we have to use the above ensemble and or an XGBoost model (as described in the fraud detection case study) in conjunction with an LLM, the LLM serves as a reasoning, orchestration, and explanation layer around your already-trained predictive model.
以下是LLM在混合式教育体系中将发挥的作用和目的的详细说明:
The following is a breakdown of the roles and purposes the LLM would serve in a hybrid system:
工具(
名称="欺诈评分工具",
func=call_xgboost_api,
描述="预测电信索赔中的欺诈概率。"
)
LLM 在内部推理欺诈时会调用此工具。
Tool(
name="FraudScoringTool",
func=call_xgboost_api,
description="Predicts fraud probability for a telecom claim."
)
The LLM calls this tool internally when reasoning about fraud.
这类似于CoT推理,并辅以模型输出。
This is akin to CoT reasoning, augmented by model outputs.
现有的 XGBoost 流水线能够进行高度精细的欺诈分类。相关代码可在 GitHub 代码库中找到,其中包含阈值调优、特征选择和可视化等功能。为了在此基础上引入逻辑推理层 (LLM),我们添加了一个智能推理层。首先,LLM 作为自然语言接口,允许用户询问“此索赔是否可能是欺诈?”。LLM 解析用户查询,提取结构化特征(例如,IS_MISSING_MOBILE 、HOUR_TO_RAISE_CLAIM ),并通过 API 封装器或 LangChain 工具调用 XGBoost 模型。
The existing XGBoost pipeline performs highly refined fraud classification. The code can be found in GitHub repository, featuring threshold tuning, feature selection, and visualization. To augment this with an LLM, we introduce an intelligent reasoning layer. First, the LLM acts as a natural language interface, allowing users to ask, is this claim likely to be fraudulent? The LLM parses user queries, extracts structured features (e.g., IS_MISSING_MOBILE, HOUR_TO_RAISE_CLAIM), and invokes the XGBoost model via an API wrapper or LangChain tool.
以下架构图展示了一个集成系统,该系统将基于 XGBoost 模型的传统欺诈检测流程与基于 GenAI 的现代聊天界面相结合。上半部分概述了数据采集过程,其中交易数据和人口统计数据使用pandas进行预处理,随后通过scikit-learn训练 XGBoost 模型。训练好的欺诈检测模型通过FastAPI接口对外开放。下半部分展示了用户与基于 LangChain 的对话代理进行交互的过程,该代理利用欺诈模型作为工具。代理在 Ollama 托管的 LLM(例如 Mistral)的支持下进行推理,从而生成上下文相关的回复。
The following architecture illustrates an integrated system that combines a traditional fraud detection pipeline using an XGBoost model with a modern GenAI-based chat interface. The upper section outlines the data ingestion process, where transaction and demographic data are preprocessed using pandas and subsequently used to train an XGBoost model via scikit-learn. This trained fraud detection model is exposed through a FastAPI interface. In the lower section, a user interacts with a LangChain-powered conversational agent that leverages the fraud model as a tool. The agent performs reasoning with the support of an Ollama-hosted LLM (e.g., Mistral) to generate contextual responses.
Figure 17.1: Architecture diagram of swapping tradition model with GenAI
接下来,LLM 执行协调工作,确定何时触发预测、重新运行阈值调整或生成SHapley 加性解释( SHAP ) 值。例如,如果用户询问为什么此声明被标记,LLM 会解释模型输出,并可以请求特征重要性图或调用 SHAP 解释器模块。
Next, the LLM performs orchestration, determining when to trigger predictions, re-run threshold tuning, or generate SHapley Additive exPlanations (SHAP) values. For example, if a user asks why was this claim flagged? the LLM interprets the model output and can request the feature importance plot or call a SHAP explainer module.
LLM提供解释,将数值预测和阈值转化为人类可读的推理:
The LLM provides explanation, converting numerical predictions and thresholds into human-readable reasoning:
根据提交速度快且缺少手机号码信息,该索赔的欺诈概率高达 91%,超过了 0.55 的最佳 F1 阈值。
This claim has a 91% fraud probability based on rapid submission and missing mobile details. It crosses the optimal F1 threshold of 0.55.
因此,LLM 将技术性的 ML 流程转变为易于理解、可解释且交互式的欺诈检测系统,分析师和决策者无需直接的编码专业知识即可使用该系统。
Thus, the LLM transforms a technical ML pipeline into an accessible, explainable, and interactive fraud detection system usable by analysts and decision-makers without direct coding expertise.
如图 17.2所示,requirements.txt文件指定了构建、训练、部署和编排混合 LLM-XGBoost 欺诈检测系统所需的所有依赖项。它包括用于模型开发和预处理的核心机器学习库,例如 XGBoost、scikit-learn 和 Pandas,以及用于 RESTful API 服务的 FastAPI 和 Uvicorn。LangChain 和 Ollama 等依赖项支持通过本地 LLM 后端进行基于自然语言工具的推理。这种统一的规范确保项目可以在不同环境中以一致的方式进行设置,并支持可复现的实验、LLM 驱动的推理工作流以及可扩展的生产部署,同时最大限度地减少配置开销。使用 `pip install -r requirements.txt`安装所有依赖项。
The requirements.txt file, as shown in Figure 17.2 specifies all necessary dependencies for building, training, serving, and orchestrating the hybrid LLM-XGBoost fraud detection system. It includes core ML libraries such as XGBoost, scikit-learn, and Pandas for model development and preprocessing, as well as FastAPI and Uvicorn for RESTful API serving. Dependencies like LangChain and Ollama enable natural language tool-based reasoning through a local LLM backend. This unified specification ensures that the project can be setup consistently across environments and supports reproducible experimentation, LLM-driven inference workflows, and scalable production deployment with minimal configuration overhead. pip install -r requirements.txt to install all dependencies.
Figure 17.2: Requirements and dependencies for the hybrid project
要设置和运行完整的流程,请按以下步骤操作,从模型训练开始,到启动 API 并执行 GenAI 代理:
To setup and run the complete pipeline, refer to the following steps in order, starting from model training to launching the API and executing the GenAI agent:
1. 训练模型:python model/train_xgb_model.py
1. Train the model: python model/train_xgb_model.py
2. 启动 FastAPI 服务器:uvicorn api.fraud_model_api:app --reload --port 8000
2. Start the FastAPI server: uvicorn api.fraud_model_api:app --reload --port 8000
3. 使用 LangChain + Ollama 运行 LLM 代理:python agent/run_agent.py
3. Run the LLM agent with LangChain + Ollama: python agent/run_agent.py
为了确保模块化、可维护性和易于部署,本项目采用了清晰的分层文件夹结构,将模型训练、API 服务、LLM 编排和实用逻辑分离,如下图所示。每个组件,例如数据预处理、XGBoost 建模、FastAPI 集成和 LangChain 工具封装,都隔离在各自的目录中,从而提升了可扩展性和清晰度。模型文件夹包含下游推理所需的所有工件,而 API 则将这些资源作为 REST 端点公开。工具和代理层支持通过 Ollama 推理代理实现与结构化机器学习预测的自然语言交互。这种结构既支持迭代开发,也支持无缝过渡到生产级系统。
To ensure modularity, maintainability, and ease of deployment, this project adopts a clean, layered folder structure that separates model training, API serving, LLM orchestration, and utility logic, as shown in the following figure. Each component, like the data preprocessing, XGBoost modeling, FastAPI integration, and LangChain tool wrapping, is isolated in its own directory, promoting scalability and clarity. The model folder contains all artifacts necessary for downstream inference, while API exposes these assets as REST endpoints. The tools and agent layers enable natural language interaction with structured ML predictions via Ollama-powered reasoning agents. This structure supports both iterative development and seamless transition to production-grade systems.
要保存训练好的 XGBoost 模型以及其他必要的组件(如选定的特征、缩放器和标签编码器),您可以使用 joblib(由于其性能优于 pickle,因此推荐用于大型模型)。
To save your trained XGBoost model, along with other necessary components like selected features, scaler, and label-encoders, you can use joblib (recommended for large models due to better performance over pickle).
下图概述了使用 FastAPI、LangChain 和 Ollama 将传统 XGBoost 模型集成到 GenAI 代理中的端到端工作流程:
The following figure outlines the end-to-end workflow for integrating a traditional XGBoost model into a GenAI agent using FastAPI, LangChain, and Ollama:
图 17.4: 将 XGBoost 模型封装到 GenAI 中的端到端流程
Figure 17.4: End-to-end pipeline for wrapping an XGBoost model into a GenAI
本实现方案提出了一种模块化混合系统,其中传统的 XGBoost 分类器通过 FastAPI 服务公开,并由使用 LangChain 的 GenAI 代理进行编排。该应用场景基于电信欺诈检测。该系统重点展示了如何将现有的机器学习流程集成到代理工作流中,从而增强系统的可解释性和可用性。
This implementation presents a modular hybrid system where a traditional XGBoost classifier is exposed through a FastAPI service and orchestrated by a GenAI agent using LangChain. The use case is based on telecom fraud detection. The system highlights how existing ML pipelines can be integrated into agentic workflows for enhanced interpretability and usability.
机器学习后端采用 XGBoost 分类器构建。脚本train_xgb_model.py执行以下顺序步骤:
The ML backend is built using an XGBoost classifier. The script train_xgb_model.py performs the following sequential steps:
1. 数据准备:数据集从data/dummy_test_vif_filtered_imputed_cleaned.csv加载,分类特征进行标签编码,数值特征使用StandardScaler进行标准化。
1. Data preparation: The dataset is loaded from data/dummy_test_vif_filtered_imputed_cleaned.csv, and categorical features are label-encoded while numerical features are standardized using StandardScaler.
2. 特征选择:递归特征消除(RFE )选择前 10 个最具预测性的特征。
2. Feature selection: recursive feature elimination (RFE) selects the top 10 most predictive features.
3. 模型训练:使用这些特征训练一个类别加权的 XGBoost 模型来处理不平衡的欺诈数据。
3. Model training: A class-weighted XGBoost model is trained using these features to handle imbalanced fraud data.
4. 模型评估:绘制精确率、召回率、F1 分数、MCC、ROC 和 PR 曲线等性能指标。进行阈值调优以确定最佳决策边界。
4. Model evaluation: Performance metrics such as precision, recall, F1 score, MCC, ROC, and PR curves are plotted. Threshold tuning is performed to identify the optimal decision boundary.
5. 模型保存:使用 joblib 将训练好的模型及其相关的预处理对象(scaler 、label_encoders、selected_features )保存到model/目录中。
5. Model saving: The trained model and its associated preprocessing objects (scaler, label_encoders, selected_features) are saved using joblib into the model/ directory.
图 17.5显示训练流程已成功完成,生成了一个高性能的用于欺诈检测的 XGBoost 模型。经过编码和缩放后,该模型进行了 RFE(随机因子提取),以保留信息量最大的预测因子。阈值调优阶段表明,决策阈值设为 0.70 时 F1 分数最大。在此阈值下,分类器总体准确率达到 89%,欺诈类别的精确率为 0.25,召回率为 0.44。诸如 MCC(0.270)和 Cohen's Kappa(0.257)等评估指标显示出中等程度的一致性,证实了该模型在处理类别不平衡问题时能够有效减少假阳性和假阴性。
Figure 17.5 shows that the training pipeline successfully completed, producing a high-performing XGBoost model for fraud detection. After encoding and scaling, the model underwent RFE to retain the most informative predictors. A threshold tuning phase revealed that a decision threshold of 0.70 maximized the F1 score. At this threshold, the classifier achieved an overall accuracy of 89%, with a precision of 0.25 and a recall of 0.44 for the fraud class. Evaluation metrics such as MCC (0.270) and Cohen’s Kappa (0.257) indicate moderate agreement, confirming the model’s effectiveness in handling class imbalance while minimizing false positives and false negatives.
训练过程还会在 model 目录下生成以下文件:
The training process will also generate these files under the directory model:
xgb_model_final.pkl文件包含已训练的 XGBoost 分类器,该分类器针对欺诈检测进行了优化。它是 API 和 GenAI 代理使用的核心预测引擎。selected_features.pkl 文件存储了通过 RFE 识别出的前 10 个特征,确保推理过程中仅使用最相关的输入。scaler.pkl 文件包含一个StandardScaler对象,用于对数值输入特征进行归一化,以与训练保持一致。
The xgb_model_final.pkl file contains the trained XGBoost classifier optimized for fraud detection. It is the core predictive engine used by the API and the GenAI agent. The selected_features.pkl stores the top 10 features identified through RFE, ensuring only the most relevant inputs are used during inference. The scaler.pkl holds a StandardScaler object used to normalize numerical input features for consistency with training.
最后,label_encoders.pkl包含LabelEncoder对象,用于将类别输入特征转换为数值形式,保留模型训练期间使用的编码逻辑,以实现可靠的实时预测。
Lastly, label_encoders.pkl contains LabelEncoder objects for transforming categorical input features into numerical form, preserving the encoding logic used during model training for reliable real-time predictions.
训练好的 XGBoost 模型通过fraud_model_api.py中的 FastAPI 提供服务。主要组件包括:
The trained XGBoost model is served via FastAPI in fraud_model_api.py. Key components include the following:
如下图所示,该层还包括跨域资源共享(CORS )中间件,以方便未来的前端集成:
As shown in the following figure, this layer also includes cross-origin resource sharing (CORS) middleware to facilitate future frontend integrations:
图 17.6:该图显示了使用 WatchFiles 启动的重载进程 [22380]
Figure 17.6: The figure shows the started reloader process [22380] using WatchFiles
fraud_tool.py文件定义了一个实用函数call_fraud_model(features: dict) ,它用作工具包装器:
The fraud_tool.py file defines a utility function call_fraud_model(features: dict), which serves as a tool wrapper:
下图显示了在端口 8080 上运行的 FastAPI:
The following figure shows FastAPI running on port 8080:
图 17.7:该图显示 FastAPI 已在端口 8080 上启动并运行。
Figure 17.7: The figure shows that the FastAPI is up and running on port 8080
在langchain_fraud_tool.py中,前面的包装器被公开为一个与 LangChain 兼容的工具:
In langchain_fraud_tool.py, the preceding wrapper is exposed as a LangChain-compatible tool:
from langchain_core.tools import Tool
from langchain_core.tools import Tool
from tools.fraud_tool import call_fraud_model
from tools.fraud_tool import call_fraud_model
欺诈检测工具 = 工具(
fraud_detection_tool = Tool(
名称="欺诈检测工具",
name="FraudDetectionTool",
func=call_fraud_model,
func=call_fraud_model,
描述="使用此功能检查电信索赔是否可能存在欺诈。提供诸如 IS_MISSING_MOBILE、HOUR_TO_RAISE_CLAIM 和 TOTAL_VERIFICATIONS 等结构化特征。"
description="Use this to check if a telecom claim is likely fraudulent. Provide structured features like IS_MISSING_MOBILE, HOUR_TO_RAISE_CLAIM, and TOTAL_VERIFICATIONS."
)
)
该工具使 GenAI 代理能够在决策过程中调用该模型。
This tool enables the GenAI agent to invoke the model as part of its decision-making process.
脚本run_agent.py实现了一个 LangChain 代理,该代理执行以下操作:
The script run_agent.py implements a LangChain agent that does the following:
关键在于,handle_parsing_errors=True用于允许智能体从模糊的 LLM 输出中恢复,从而确保推理周期中的鲁棒性:
Critically, handle_parsing_errors=True is used to allow the agent to recover from ambiguous LLM output, ensuring robustness during reasoning cycles:
agent = initialize_agent(
agent = initialize_agent(
工具=[欺诈检测工具],
tools=[fraud_detection_tool],
llm=llm,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True,
verbose=True,
handle_parsing_errors=True
handle_parsing_errors=True
)
)
响应结果会打印到终端上,显示解释后的输出和欺诈预测结果:
The response is printed to the terminal, showing the interpreted output and fraud prediction:
收到用户指令后,LLM 会解析自然语言查询,识别出用户请求进行欺诈检查。它会从查询中提取相关特征,将其格式化为结构化的 JSON 有效负载,并调用一个工具将此输入发送到托管 XGBoost 模型的 FastAPI 服务。收到欺诈概率评分后,LLM 会解析结果,并根据已知的特征重要性(例如,缺少移动设备信息或提交时间异常)生成易于理解的解释。最后,它会向用户返回清晰易懂的对话式回复,并可选择性地建议后续操作,例如审核或拒绝。
After receiving a user instruction, the LLM interprets the natural language query to identify that a fraud check is requested. It extracts relevant features from the query, formats them into a structured JSON payload, and invokes a tool that sends this input to the FastAPI service hosting the XGBoost model. Upon receiving the fraud probability score, the LLM interprets the result and generates a human-readable explanation based on known feature importances (e.g., missing mobile or odd submission hours). Finally, it returns a clear, conversational response to the user, optionally suggesting next actions like review or rejection.
完整的代码可以在 GitHub 代码库中找到。
The end-to-end code can be found in the GitHub repository.
在混合型 GenAI/ML 系统中,集成传统模型(例如 CNN、分割模型、ANN 和 OCR)在数据类型、模型架构、部署复杂性和与 LLM 的交互方面存在差异。以下是它们在实现和集成方面可能存在的差异的比较概述:
In a hybrid GenAI/ML system, integrating traditional models like CNNs, segmentation models, ANNs, and OCR differs in terms of data types, model architecture, deployment complexity, and interaction with LLMs. The following is a comparative overview of how their implementation and integration may differ:
|
型号 Model type |
用例 Use case |
执行 Implementation |
服务策略 Serving strategy |
LLM整合 LLM integration |
|
人工神经网络 ANNs |
结构化/表格化任务(例如,客户流失预测、风险评分)。 Structured/tabular tasks (e.g., churn prediction, risk scoring). |
数值和类别特征的预处理(缩放、编码);类似于 XGBoost 流水线。 Preprocessing of numerical and categorical features (scaling, encoding); resembles XGBoost pipelines. |
封装成 API,接受向量并返回预测结果。 Wrapped as APIs that take vectors and return predictions. |
LLM发送输入向量 | 接收预测 | 用自然语言解释结果。 LLM sends input vector | receives prediction | explains result in natural language. |
|
CNN CNNs |
基于图像的任务(例如,分类、缺陷检测) Image-based tasks (e.g., classification, defect detection) |
图像预处理(调整大小、归一化);在带标签的图像数据集上进行训练。 Image preprocessing (resizing, normalization); trained on labeled image datasets. |
通过 REST API 提供服务,接受图像文件(base64 或 URL);返回标签/概率。 Served via REST APIs accepting image files (base64 or URLs); returns labels/probabilities. |
LLM 将用户查询编码为图像上传 + 元数据 | 调用 CNN | 解释结果(例如,检测到缺陷)。 LLM encodes user query into image upload + metadata | invokes CNN | interprets result (e.g., defect detected). |
|
分割模型 Segmentation models |
像素级分类(例如,医学影像、卫星数据)。 Pixel-wise classification (e.g., medical imaging, satellite data). |
输出分割掩码;通常需要 GPU 支持的服务。 Outputs segmentation masks; often requires GPU-backed serving. |
通过 TorchServe/TF Serving 和 GPU 提供服务;返回覆盖层/掩码。 Served via TorchServe/TF Serving with GPU; returns overlays/masks. |
LLM 发送图像 + 上下文 | 接收掩模 | 解释分割区域(例如,肿瘤边界)。 LLM sends image + context | receives mask | explains segmented regions (e.g., tumor boundary). |
|
OCR OCR |
从图像中提取文本(例如,收据、文档)。 Text extraction from images (e.g., receipts, documents). |
使用 Tesseract 或 EasyOCR 等工具提取非结构化文本。 Uses tools like Tesseract or EasyOCR to extract unstructured text. |
用作工具/API,从图像输入返回原始文本。 Served as a tool/API returning raw text from image input. |
LLM 将 OCR 输出与语义推理相结合(例如,发票金额是多少?)。 LLM combines OCR output with semantic reasoning (e.g., what is the invoice amount?). |
表 17.1:GenAI 工作流程中机器学习模型集成的比较概述
Table 17.1: Comparative overview of ML model integration in GenAI workflows
作为本章的实践延伸,您的任务是构建一个 LangChain 代理,该代理与基于图的推荐模型进行交互,这种模型常用于产品推荐、社交网络建议或内容发现等场景。首先,选择或实现一个使用图数据结构的推荐模型,例如 Node2Vec 的节点嵌入、个性化 PageRank 或图神经网络( GNN )。该模型应提供一个函数或 API 端点,该端点接受用户 ID 或物品 ID,并返回一个按排名排列的推荐节点列表。
As a practical extension of this chapter, your task is to build a LangChain agent that interfaces with a graph-based recommendation model, commonly used in scenarios like product recommendation, social network suggestions, or content discovery. Begin by selecting or implementing a recommendation model that uses graph data structures, such as node embeddings from Node2Vec, Personalized PageRank, or a graph neural network (GNN). The model should expose a function or API endpoint that accepts a user ID or item ID and returns a ranked list of recommended nodes.
接下来,通过定义描述、预期输入模式和输出行为,将此函数或 API 封装到 LangChain 工具中。然后,使用本地语言学习模型 (LLM)(例如,通过 Ollama 使用 Mistral)创建一个 LangChain 代理,该代理能够理解自然语言指令,例如“推荐与此商品相似的产品”或“用户 123 接下来应该观看什么?” 。 LLM 应解析意图,提取用户或商品 ID,调用图推荐工具,并用通俗易懂的英语解释输出结果。这项任务强化了本章的关键概念——工具封装、代理编排和推理,并将它们应用于基于图的 AI 系统这一新的互补领域。
Next, wrap this function or API into a LangChain tool by defining a description, expected input schema, and output behavior. Then, use a local LLM (e.g., Mistral via Ollama) to create a LangChain agent that can interpret natural language instructions like suggest products similar to this item or what should user 123 watch next? The LLM should parse the intent, extract the user or item ID, call the graph recommendation tool, and explain the output in plain English. This task reinforces key concepts from the chapter—tool wrapping, agent orchestration, and reasoning, while applying them to a new but complementary domain of graph-based AI systems.
本章演示了如何构建一个混合人工智能系统,该系统结合了传统 XGBoost 模型的预测能力和类似 Mistral 的语言学习模型(LLM)的推理和语言处理能力,并通过 Ollama 实现。我们首先使用 XGBoost 实现了一个稳健的欺诈检测流程,其中包括类别不平衡处理、特征选择、阈值调优和性能评估。训练好的模型和预处理组件使用 joblib 保存,以便后续进行推理。
In this chapter, we demonstrated how to build a hybrid AI system that combines the predictive power of a traditional XGBoost model with the reasoning and language capabilities of a LLM like Mistral via Ollama. We began by implementing a robust fraud detection pipeline using XGBoost, incorporating class imbalance handling, feature selection, threshold tuning, and performance evaluation. The trained model and preprocessing components were saved using joblib for downstream inference.
接下来,我们使用 Fast API 将模型部署为 REST API,从而实现实时预测。然后,我们构建了一个与 LangChain 兼容的工具,该工具调用此 API,并将其封装到一个由本地托管的 LLM 提供支持的推理代理中。该代理接收自然语言查询,提取结构化特征,调用 XGBoost 模型,使用预先计算的特征重要性解释结果,并提供易于理解的解释和建议。
Next, we deployed the model as a REST API using Fast API, enabling real-time predictions. We then constructed a LangChain-compatible tool that calls this API and wrapped it into a reasoning agent powered by a locally hosted LLM. This agent receives natural language queries, extracts structured features, invokes the XGBoost model, interprets the result using precomputed feature importances, and delivers human-readable explanations and recommendations.
我们还定义了清晰的项目文件夹结构、完整的用于可复现的需求文档(requirements.txt)以及流程图。最终成果是一个模块化、可解释且可扩展的人工智能系统,其中传统机器学习和基因人工智能协同工作,为实际应用提供智能欺诈检测和决策支持。
We also defined a clear project folder structure, a complete requirements.txt for reproducibility, and a process flowchart. The result is a modular, explainable, and scalable AI system where traditional ML and GenAI collaborate to provide intelligent fraud detection and decision support in real-world applications.
下一章,我们将介绍LLM 操作(LLMOps )和 GenAI 评估技术。
In the next chapter, we will cover LLM operations (LLMOps) and GenAI evaluation techniques.
加入我们的 Discord 空间
Join our Discord space
加入我们的 Discord 工作区,获取最新资讯、优惠信息、全球科技动态、新版本发布以及与作者的交流:
Join our Discord workspace for latest updates, offers, tech happenings around the world, new releases, and sessions with the authors:
这是在对众多生成式人工智能(GenAI )系统进行实施和理解之后,涵盖检索增强生成(RAG )、代理编排、多模态管道和优化框架等不同领域的最后一章。在本章的总结部分,我们将重点转向使这些智能系统可靠、可扩展且可用于生产环境的运维和评估基础架构。本章将深入探讨大型语言模型操作(LLMOps )和RAGOps,这是一套用于管理基于LLM的应用程序在实际环境中生命周期的关键实践、工具和设计原则。您将探索RAG管道的部署、监控、可观测性、版本控制和自适应反馈循环等主题,以及确保LLM驱动产品的弹性、可追溯性和治理的策略。
This is the final chapter after implementing and understanding numerous generative AI (GenAI) systems across diverse domains, from retrieval-augmented generation (RAG) and agent orchestration to multimodal pipelines and optimization frameworks. In this concluding section, we will now shift our focus to the operational and evaluative backbone that makes these intelligent systems reliable, scalable, and production-ready. This chapter delves into large language model operations (LLMOps) and RAGOps, a critical set of practices, tools, and design principles for managing the lifecycle of LLM-based applications in real-world settings. You will explore topics such as deployment, monitoring, observability, versioning, and adaptive feedback loops for RAG pipelines, as well as strategies to ensure resilience, traceability, and governance in LLM-driven products.
除了卓越的运营之外,我们还采用生成人工智能(GenAI)评估技术,这对于衡量质量、相关性、准确性和用户一致性至关重要。传统指标往往难以捕捉生成系统的细微性能,因此我们引入了自动评估策略和人机协同(HITL )评估策略。这包括评分机制、现代一致性指标、幻觉检测以及基于模型的评估框架。
Alongside operational excellence, we turn to GenAI evaluation techniques, which are essential for measuring quality, relevance, accuracy, and user alignment. Traditional metrics often fall short in capturing the nuanced performance of generative systems, so we introduce both automatic and human-in-the-loop (HITL) evaluation strategies. This includes scoring mechanisms as well as modern alignment metrics, hallucination detection, and model-grounded evaluation frameworks.
这些运营和评估基础共同使您能够在 GenAI 时代自信地从实验阶段过渡到企业级部署。
Together, these operational and evaluative foundations enable you to confidently move from experimentation to enterprise-grade deployment in the era of GenAI.
本章我们将学习以下主题:
In this chapter, we will learn about the following topics:
本章旨在介绍和阐述 RAGOps,这是一种用于在实际 GenAI 部署中实现 RAG 系统运维的结构化方法。本章探讨了运维在生产级 GenAI 应用中的重要性,区分了 LLM 和 RAG 评估方法,并强调了两者如何支持系统的持续可观测性和可靠性。读者将了解如何在开发和部署后阶段实施 RAGOps,如何利用核心可观测性平台,以及如何将这些概念应用于基于图增强的 RAG 推荐系统。一个实践练习将指导读者完成端到端的实施,并强调可扩展 GenAI 系统中可追溯性、监控和评估的必要性。
The objective of this chapter is to introduce and conceptualize RAGOps, a structured approach to operationalizing RAG systems in real-world GenAI deployments. It explores the significance of Ops in production-grade GenAI applications, differentiates between LLM and RAG evaluation methodologies, and emphasizes how both support continuous system observability and reliability. Readers will understand how to implement RAGOps during development and post-deployment phases, utilize core observability platforms, and apply these concepts in a graph-enhanced RAG-based recommendation system. A practical to do guides readers through end-to-end implementation, reinforcing the need for traceability, monitoring, and evaluation in scalable GenAI systems.
考虑一个实际应用案例:一个面向流媒体平台的个性化内容推荐系统,该系统采用基于图的模型,并结合LLM生成的摘要和基于RAG的用户查询解释。最初,该系统在实验室环境下运行良好,在受控输入测试中能够返回相关内容。然而,一旦部署到生产环境,就会出现诸多挑战,而运维(LLMOps和RAGOps)正是在此发挥关键作用。
Consider a real-world application (use case illustration): A personalized content recommendation system for a streaming platform that uses a graph-based model enriched with LLM-generated summaries and RAG-based user query interpretation. Initially, the system works well in the lab, returning relevant content when tested with controlled inputs. However, once deployed in production, several challenges emerge, and this is where Ops (LLMOps and RAGOps) becomes critical.
例如,随着用户流量的增加,由于数据检索链过长或应用程序编程接口( API ) 的速率限制,模型的延迟也会增加。如果没有适当的监控,这种性能下降可能难以察觉,从而降低用户体验( UX )。运维实践允许您设置延迟和吞吐量监控,以便在用户受到影响之前向工程团队发出异常警报。
For instance, as user traffic increases, the model's latency grows due to long retrieval chains or application programming interface (API) rate limits. Without proper monitoring, this slowdown could go unnoticed, degrading the user experience (UX). Ops practices allow you to setup latency and throughput monitoring, alerting the engineering team to anomalies before users are affected.
此外,用户行为会随时间推移而改变;新的音乐类型、俚语或热门话题可能会降低预训练词嵌入或图连接的相关性。如果没有自适应重训练流程或反馈感知索引,推荐结果就会过时。RAGOps 通过定期更新或实时反馈循环来确保向量存储和知识库定期刷新。
Additionally, user behavior may shift over time; new genres, slang, or trending topics could reduce the relevance of the pre-trained embeddings or graph connections. Without adaptive retraining pipelines or feedback-aware indexing, the recommendations will become stale. RAGOps ensures the vector store and knowledge base are refreshed regularly, either through scheduled updates or real-time feedback loops.
现在,想象一下幻觉突然激增,导致LLM生成不准确或无关的摘要。借助完善的评估和日志记录系统,运维团队能够识别模型响应何时偏离预期行为,并触发备用机制或标记以供审查。
Now, imagine a sudden spike in hallucinations, where the LLM generates inaccurate or irrelevant summaries. With a robust evaluation and logging system in place, Ops helps identify when model responses deviate from expected behavior and triggers a fallback mechanism or flags for review.
此外,版本控制和回滚机制至关重要。如果新的模型或图表更新导致质量下降,运维团队可以快速回滚到稳定版本,而不会中断整个系统。
Moreover, versioning and rollback mechanisms are essential. If a new model or graph update causes quality to drop, Ops allows teams to quickly revert to a stable version without disrupting the entire system.
RAGOps 的另一个关键方面是跟踪和管理嵌入。在生产系统中,嵌入不仅代表静态内容,还代表应用程序不断演进的知识库。文档的更改、用户偏好设置,甚至 LLM 的更新都会影响嵌入的质量和相关性。如果没有嵌入版本控制和日志记录,就很难追踪检索失败的原因或 LLM 生成离题响应的原因。运维实践支持嵌入元数据日志记录、时间戳和集合版本控制,使团队能够审核特定查询使用了哪些嵌入、它们的生成时间以及它们是否与最新内容一致。这种可追溯性对于调试、合规性和检索层的持续改进至关重要。
Another critical aspect of RAGOps is tracking and managing embeddings. In production systems, embeddings represent not just static content, but the evolving knowledge base of your application. Changes to documents, user preferences, or even updates in the LLM can affect embedding quality and relevance. Without embedding version control and logging, it is difficult to trace why retrievals are failing or why the LLM is generating off-topic responses. Ops practices enable embedding metadata logging, timestamping, and collection versioning, allowing teams to audit which embeddings were used for a specific query, when they were generated, and whether they align with the latest content. This traceability is essential for debugging, compliance, and continual improvement of the retrieval layer.
因此,如果没有 LLMOps 和 RAGOps,即使是最具创新性的 GenAI 应用也面临大规模失败的风险。运维确保可靠性、可观测性、治理和持续改进,将原型转化为值得信赖、生产级的解决方案,并持续创造价值。
So, without LLMOps and RAGOps, even the most innovative GenAI applications risk failure at scale. Ops ensure reliability, observability, governance, and continuous improvement, transforming a prototype into a trustworthy, production-grade solution that consistently delivers value.
在开发 RAG 系统时,必须理解 LLM 评估和 RAG 评估之间的区别,因为两者针对的是整个流程的不同组成部分。虽然两者都旨在评估质量、相关性和性能,但它们侧重于系统的不同阶段,并且需要不同的技术和指标。
When developing a RAG system, it is essential to understand the distinction between LLM evaluation and RAG evaluation, as each targets different components of the overall pipeline. While both aim to assess quality, relevance, and performance, they focus on different stages of the system and require different techniques and metrics.
语言模型评估是指评估语言模型在给定输入提示的情况下生成准确、流畅且符合语境的响应的能力。该评估通常在模型选择、微调或验证阶段进行,主要关注以下方面:
LLM evaluation refers to assessing the language model’s ability to generate accurate, fluent, and contextually appropriate responses, given an input prompt. This evaluation is typically performed during model selection, fine-tuning, or validation stages and focuses on:
常用的评估方法包括:
Common evaluation methods include:
其他方法包括人工评估质量等级和基于提示的单元测试,以评估推理、总结或幻觉倾向。
Other methods include human evaluation for quality ratings and prompt-based unit tests to assess reasoning, summarization, or hallucination tendencies.
LLM 评估对于了解模型在没有外部检索上下文的情况下独立运行的性能至关重要。
LLM evaluation is crucial for understanding how the model performs in isolation, without the external retrieved context.
相比之下,RAG 评估侧重于完整的检索+生成流程,衡量系统检索相关文档并利用这些文档生成基于上下文的答案的效率。它包含以下几个层级:
RAG evaluation, by contrast, focuses on the complete retrieval + generation pipeline, measuring how effectively the system retrieves relevant documents and uses them to generate grounded, context-aware answers. It involves several layers, which are as follows:
RAG评估还强调日志记录和可追溯性,追踪检索到的文档、使用的嵌入版本以及提示信息的生成方式。这有助于对系统故障进行根本原因分析,并实现持续改进。
RAG evaluation also emphasizes logging and traceability, tracking which documents were retrieved, which embedding version was used, and how prompts were formed. This enables root cause analysis of system failures and continual improvement.
LLM 评估有助于判断模型的独立性能,而 RAG 评估则评估整个流程在实际应用中的表现。一个模型单独使用时可能生成完美的答案,但如果与糟糕的检索策略结合使用,则可能失效。反之,即使检索结果很好,但如果 LLM 不匹配,也可能导致结果出现偏差。因此,为了确保 RAG 系统可靠且达到生产级标准,必须同时进行这两种评估。
While LLM evaluation helps you judge the standalone capabilities of your model, RAG evaluation assesses how well the entire pipeline performs in practice. A model might generate perfect answers in isolation but fail when paired with poor retrieval. Conversely, great retrieval with a misaligned LLM could lead to hallucinations. Therefore, both types of evaluation must be conducted independently and jointly to ensure a reliable, production-grade RAG system.
在生产级 GenAI 系统中,尤其是在 RAG 架构中,评估不仅仅是一项开发活动,更是运维(LLMOps 和 RAGOps)的关键组成部分。这些评估是监控质量、诊断故障、维护系统完整性和实现持续改进的基础。
In production-grade GenAI systems, especially RAG architectures, evaluation is not just a development activity; it is a critical component of Ops (LLMOps and RAGOps). These evaluations serve as the foundation for monitoring quality, diagnosing failures, maintaining system integrity, and enabling continuous improvement.
在生产环境中,用户期望很高;每一条响应都必须切题、流畅且有理有据。评估使您能够使用自动化指标(例如 BLEU、ROUGE 和 BERTScore)和 HITL 反馈系统来量化质量。这些评估对于建立质量基线、定义可接受的性能阈值以及检测性能随时间推移而发生的下降至关重要。
In production, user expectations are high; every response must be relevant, fluent, and grounded. Evaluations enable you to quantify quality using both automated metrics (like BLEU, ROUGE, and BERTScore) and HITL feedback systems. These evaluations are essential for establishing quality baselines, defining acceptable performance thresholds, and detecting degradation over time.
例如,如果在模型更新后,实时 A/B 测试中的 BERTScore 或 METEOR 指标下降,运维团队可以触发回滚或将流量路由到更稳定的版本。这种持续的评估循环确保模型更新不会悄无声息地降低用户体验。
For example, if BERTScore or METEOR drops in real-time A/B tests after a model update, Ops teams can trigger rollbacks or route traffic to a more stable version. This continuous evaluation loop ensures that model updates do not silently degrade UX.
LLMOps 必须考虑概念漂移,即模型性能因用户行为、词汇或上下文变化而下降。评估有助于及早发现这种漂移。例如,幻觉率的上升(可通过忠实度指标或基于 LLM 的指标来衡量)就是一个潜在的问题。验证器可以指出检索到的文档已过时、不相关或与用户查询不符。
LLMOps must account for concept drift, where the model’s performance decays due to changing user behavior, vocabulary, or context. Evaluations help detect this drift early. For instance, a rise in hallucination rates, measured through faithfulness metrics or LLM-based verifiers, can indicate that retrieved documents are outdated, irrelevant, or misaligned with user queries.
通过不断评估生成结果的可靠性,RAGOps 系统可以跟踪输出结果何时偏离检索到的文档,并触发自动索引刷新、嵌入重新生成或重新训练计划。
By continuously evaluating generation groundedness, RAGOps systems can track when outputs deviate from retrieved documents and trigger automatic index refresh, embedding re-generation, or retraining schedules.
RAG 系统严重依赖向量存储和知识库。即使 LLM 运行正常,检索质量差通常也是输出结果不佳的根本原因。诸如 Recall@k、嵌入相似度得分和覆盖率得分等评估指标可以实时反映检索效果。
RAG systems rely heavily on vector stores and knowledge bases. Poor retrieval quality is often the root cause of bad outputs, even if the LLM is functioning correctly. Evaluation metrics like Recall@k, Embedding Similarity Score, and Coverage Score provide real-time insight into retrieval effectiveness.
通过可视化这些指标的操作仪表盘,团队可以识别低召回率查询、不相关的文档匹配或新内容的冷启动问题。此类评估有助于优化检索器、及时进行工程调整,或在无需重新训练 LLM 的情况下嵌入索引重建。
Operational dashboards that visualize these metrics allow teams to identify low-recall queries, irrelevant document hits, or cold-start issues with new content. Such evaluations enable retriever tuning, prompt engineering adjustments, or embedding index regeneration without needing to retrain the LLM.
LLM 和 RAG 评估均支持运维层面的可追溯性。在复杂的 GenAI 系统中,能够追踪哪个版本的检索器、嵌入模型或 LLM 生成了特定答案,对于合规性、审计和调试至关重要。评估日志可作为结构化证据,证明特定管道版本在部署前已满足所需的性能标准。
Both LLM and RAG evaluations support Ops-level traceability. In complex GenAI systems, being able to track which version of a retriever, embedding model, or LLM produced a specific answer is critical for compliance, audits, and debugging. Evaluation logs act as structured evidence that a given pipeline version met required performance standards before deployment.
这些评估也可以用于持续集成和持续部署( CI/CD ) 管道,其中测试时 BLEU、ROUGE 或 F1 答案的任何下降都会阻止生产部署,直到问题得到解决。
These evaluations can also be used in continuous integration and continuous deployment (CI/CD) pipelines, where any drop in test-time BLEU, ROUGE, or answer F1 blocks production deployment until the issue is resolved.
高级 GenAI Ops 融合了反馈感知重训练和基于人类反馈的强化学习( RLHF )。评估指标为这些反馈循环提供信号,使系统能够从用户评分、点击量或更正中学习。
Advanced GenAI Ops incorporates feedback-aware retraining and Reinforcement Learning from Human Feedback (RLHF). Evaluation metrics provide the signal for these feedback loops, enabling the system to learn from user ratings, click-throughs, or corrections.
例如,如果用户对某个答案的评价很低,评估系统可以将其与检索到的文档进行比较,并标记出这是检索问题还是生成问题。这种针对性的洞察支持精细化的优化,而不仅仅是通用的重新训练。
For instance, if a user rates an answer poorly, evaluations can compare it against retrieved documents and flag whether it is a retrieval issue or a generation issue. This targeted insight supports fine-grained optimization, not just generic retraining.
在 GenAI 运维中,评估指标是可观测性工具;它们能够实时展现系统行为、检测故障、指导回滚、提供重新训练信息,并实现智能自动化。如果没有强大的 LLM 和 RAG 评估,运维团队实际上就像盲人摸象,只能被动地应对用户投诉,而无法主动确保系统的可靠性和可信度。
In GenAI Ops, evaluation metrics are observability tools; they expose system behavior in real-time, detect faults, guide rollbacks, inform retraining, and enable intelligent automation. Without robust LLM and RAG evaluation, Ops teams are effectively blind, reacting to user complaints instead of proactively ensuring system reliability and trustworthiness.
RAGOps 指的是应用于 RAG 系统整个生命周期的运维实践、工具和监控策略。它涵盖了关键组件的评估、跟踪和优化,例如嵌入生成、文档检索、重排序、提示构建和语言模型输出。RAGOps 确保系统在开发和生产环境中保持准确性、可靠性和高性能。通过集成可观测性、版本控制、反馈循环和自动化评估,RAGOps 使团队能够检测偏差、减少误判并确保与用户意图保持一致。最终,RAGOps 对于构建基于检索增强架构的可扩展、可信赖且持续改进的 GenAI 应用至关重要。
RAGOps refers to the operational practices, tools, and monitoring strategies applied to RAG systems across their lifecycle. It encompasses the evaluation, tracking, and optimization of key components such as embedding generation, document retrieval, reranking, prompt construction, and language model outputs. RAGOps ensures that systems remain accurate, grounded, and performant in both development and production environments. By integrating observability, versioning, feedback loops, and automated evaluation, RAGOps enables teams to detect drift, reduce hallucinations, and maintain alignment with user intent. Ultimately, RAGOps is essential for building scalable, trustworthy, and continuously improving GenAI applications based on retrieval-enhanced architectures.
在 RAG 系统的开发和后期开发阶段,RAGOps 在确保质量、可靠性和可追溯性方面发挥着至关重要的作用。在开发阶段,它能够通过指标和可观测性工具,对嵌入、检索精度、提示构建和生成接地性进行系统评估。后期开发阶段,RAGOps 的重点转向生产环境中的监控、漂移检测、实时故障跟踪和反馈集成。通过在整个生命周期中应用 RAGOps 实践,团队可以主动解决问题、执行质量基准并支持持续改进,从而将 RAG 系统从实验原型转变为可扩展、可靠的解决方案,并可随时部署到实际环境中。
During both the development and post-development phases of a RAG system, RAGOps plays a vital role in ensuring quality, reliability, and traceability. During development, it enables systematic evaluation of embeddings, retrieval accuracy, prompt construction, and generation groundedness through metrics and observability tools. Post-development, RAGOps shifts focus on monitoring, drift detection, real-time failure tracking, and feedback integration in production environments. By applying RAGOps practices throughout the lifecycle, teams can proactively address issues, enforce quality benchmarks, and support continuous improvement, transforming RAG systems from experimental prototypes into scalable, dependable solutions ready for real-world deployment.
以下列表概述了 RAGOps 在开发过程中的目标、实践和目的:
The following list outlines the objectives, practices, and goals of RAGOps during development:
由于 RAG 系统的多组件性、非确定性和模块化特性,在开发过程中识别和评估 RAGOps 本身就非常复杂。要在开发过程中实现稳健的可观测性和评估,需要采用结构化的方法。
Identifying and benchmarking RAGOps during development is inherently complex due to the multi-component, non-deterministic, and modular nature of RAG systems. Achieving robust observability and evaluation during development requires a structured approach.
识别是 RAGOps 的基础阶段。它专注于发现 RAG 流程中可能发生故障的关键点,并确定究竟需要跟踪哪些内容,以确保质量、可靠性和可追溯性。
Identification is the foundational phase of RAGOps. It focuses on discovering the key points in the RAG pipeline where failures may occur and establishing what exactly needs to be tracked to ensure quality, reliability, and traceability.
首先,必须将 RAG 系统分解为各个组成部分:嵌入创建、检索、可选重排序、提示构建和生成。对于每个阶段,开发人员都必须识别可能出现的问题以及这些问题的具体表现形式。
To begin with, the RAG system must be decomposed into its constituent components: embedding creation, retrieval, optional reranking, prompt construction, and generation. For each of these stages, developers must identify what could go wrong and how those failures would manifest.
例如,在词嵌入创建过程中,低质量的向量表示会导致检索结果不佳。因此,监控词嵌入的偏差、确保文档覆盖率以及跟踪索引更新的及时性至关重要。开发人员还应检查词嵌入是否反映了内容的当前状态,或者是否使用了过时的向量。
For instance, during embedding creation, low-quality vector representations can lead to poor retrieval results. Therefore, it is important to monitor embedding drift, ensure complete document coverage, and track the timeliness of index updates. Developers should also examine whether embeddings reflect the current state of the content or if stale vectors are in use.
检索阶段经常因语义不匹配或前k个结果排名不合理而失败。要识别这些问题,需要跟踪检索到的文档与用户查询的相关性。这涉及到检查检索到的文档与真实值或预期值之间的重叠情况。
The retrieval stage often fails due to semantic mismatch or inadequate top-k ranking. Identifying issues here requires tracking how often retrieved documents are relevant to user queries. This involves examining the overlap between retrieved documents and ground truth or expectations.
在包含重排序层的系统中,会引入额外的复杂性。故障可能包括对真正相关的文档进行错误排序,或者在重复运行中出现不稳定。这就需要跟踪重排序对检索顺序的影响程度,以及它是否能改善下游的生成。
In systems with reranking layers, additional complexity is introduced. Failures may include misranking of truly relevant documents or instability across repeated runs. This requires tracking how much reranking changes the retrieval order and whether it improves downstream generation.
提示构建是另一个需要格外注意的步骤,格式错误或过长的提示会导致语言模型接收到的输入被截断或错位。识别此类问题需要监控提示模板的一致性、词元长度和格式错误。
Prompt construction is another sensitive step, where malformed or overlong prompts can lead to truncated or misaligned inputs to the language model. Identifying such issues requires monitoring prompt template consistency, token length, and formatting errors.
最后,在生成阶段,幻觉和不连贯的输出是最常见的问题。开发人员必须确定模型是否忠实地使用了检索到的内容,并避免生成虚假信息。这需要检查生成的输出与源文档之间的一致性。
Finally, in the generation phase, hallucinations and incoherent outputs are the most common issues. Developers must identify whether the model faithfully uses the retrieved content and avoids producing fabricated information. This entails inspecting the alignment between the generated output and the source documents.
因此,识别是一个诊断过程。它通过揭示 RAG 管道的哪些部分脆弱以及哪些指标或信号指示这些脆弱性,为可观测性奠定了基础。
Identification is, therefore, a diagnostic process. It sets the stage for observability by exposing which parts of the RAG pipeline are fragile and which metrics or signals are indicative of those fragilities.
一旦确定了关键跟踪点,下一步就是进行基准测试,为每个组件定义量化基线和质量标准。
Once critical tracking points are identified, the next step is benchmarking, defining quantitative baselines and quality standards for each component.
基准测试始于创建黄金标准评估数据集。由于在早期开发阶段无法获得真实用户数据或真实用户数据不具代表性,因此该数据集通常是该系统由人工筛选或系统自动生成的查询组成,每个查询都与已知的相关文档和预期输出相匹配。这种受控设置使得系统性能能够以一致且可重复的方式进行评估。
Benchmarking begins with the creation of a gold-standard evaluation dataset. Since live user data is unavailable or unrepresentative during early development, this dataset is typically composed of manually curated or synthetically generated queries, each paired with known relevant documents and expected outputs. This controlled setup allows the system’s performance to be measured in a consistent, repeatable manner.
接下来,对于 RAG 流程的每个阶段,开发者必须定义相应的评估指标。例如,检索阶段使用 Recall@k 和 precision@k 等指标来评估成功检索到的相关文档数量。对于模型生成阶段,则使用 BERTScore、忠实度评分和幻觉率等指标来评估语义正确性和扎根性。
Next, for each stage of the RAG pipeline, developers must define the appropriate evaluation metrics. For example, the retrieval stage is evaluated using metrics such as Recall@k and precision@k to assess how many relevant documents are successfully retrieved. For generations, metrics such as BERTScore, faithfulness score, and hallucination rate have been used to assess semantic correctness and groundedness.
系统使用基准数据集进行测试后,所得分数将被记录为基线值。这些分数作为参考点,用于比较后续迭代流程的性能,以判断是否存在退步或改进。基准测试不仅仅是收集数据,更重要的是确定可接受的标准。这涉及到定义容差阈值。例如,要求检索召回率不低于某个特定值,或者幻觉发生率保持在设定的最大值以下。
Once the system is tested against the benchmark dataset, the resulting scores are recorded as baseline values. These scores serve as reference points, allowing future iterations of the pipeline to be compared for regression or improvement. Benchmarking is not just about collecting numbers but about establishing what is acceptable. This involves defining tolerance thresholds. For example, requiring that retrieval recall does not fall below a certain value or that hallucination rates remain under a defined maximum.
至关重要的是,基准测试必须与版本控制相结合。每个基准测试结果都必须与嵌入模型、向量索引、提示模板或重排序器的特定版本相关联。这确保了观察到的性能变化可以追溯到流程中的特定修改。
Crucially, benchmarking must also be tied to version control. Each benchmark result must be associated with a specific version of the embedding model, vector index, prompt template, or reranker. This ensures that observed changes in performance can be traced back to specific modifications in the pipeline.
基准测试的结束不仅在于记录基线分数,更在于将其整合到开发工作流程中。这可以采取测试期间的手动检查清单或 CI/CD 流水线中的自动化关卡等形式。其目标是确保 RAG 流水线的每个组件在投入生产之前都符合最低性能标准。
Benchmarking concludes when not only have baseline scores been recorded, but when they are integrated into the development workflow. This can take the form of manual checklists during testing or automated gates in a CI/CD pipeline. The objective is to ensure that every component of the RAG pipeline adheres to minimum performance standards before being considered ready for production.
识别提供了可观测性框架,明确了需要观察的内容以及可能出现问题的地方,而基准测试则设定了量化参考标准,以及系统必须达到的良好程度才能满足运行标准。它们共同构成了 RAGOps 在开发过程中的核心,确保系统稳健、可解释,并能够在运行约束下不断演进。现在,让我们重点关注开发后的阶段。
Identification provides the observability structure, what to watch, and where issues might arise, while benchmarking sets the quantitative reference and how good the system must be to meet operational standards. Together, they form the core of RAGOps during development, ensuring that the system is robust, interpretable, and ready to evolve under operational constraints. Now, let us focus on the post-development phase.
下表 18.1 “ RAGOps 开发过程中的跟踪”概述了需要跟踪的内容、可能出现故障的位置,以及在 RAG 开发流程的每个阶段可以使用的指标或工具。本指南旨在帮助您在部署前将可观测性和评估集成到开发工作流程中。
The following Table 18.1, RAGOps tracking during development, outlines what to track, where failures may arise, and which metrics or tools can be used at each stage of the RAG development pipeline. This serves as a practical guide to integrate observability and evaluation into your development workflow before deployment.
|
阶段 Stage |
关键故障点 Key failure points |
追踪什么 What to track |
指标/工具 Metrics/Tools |
|
数据摄取和嵌入创建 Data ingestion and embedding creation |
嵌入内容质量低劣或已过时,文档缺失,以及格式问题。 Low-quality or outdated embeddings, missing documents, and format issues. |
嵌入质量、文档数量、格式有效性和嵌入漂移。 Embedding quality, document count, format validity, and embedding drift. |
嵌入相似度、覆盖率 %) LangChain 日志、Facebook AI 相似性搜索( Faiss ) 索引统计信息。 Embedding similarity, coverage %, LangChain logs, Facebook AI Similarity Search (Faiss) index stats. |
|
检索 Retrieval |
无关的前 k 个结果、检索延迟、回忆效果差。 Irrelevant top-k results, retrieval latency, poor recall. |
召回率@k、精确率@k、查询延迟、检索文档重叠率。 Recall@k, precision@k, query latency, retrieved document overlap. |
召回@k,查询时间,Langfuse/Arize Phoenix 日志。 Recall@k, query time, Langfuse/Arize Phoenix logs. |
|
重新排名(如果使用) Reranking (if used) |
排名错误、评分嘈杂、上下文不匹配。 Incorrect ranking, noisy scoring, context mismatch. |
得分差异性、前1相关性和排名稳定性。 Score divergence, top-1 relevance, and rank stability. |
重排序得分方差、相关性指标和评估轨迹。 Reranker score variance, correlation metrics, and evaluation traces. |
|
快速建设 Prompt construction |
提示信息过长、格式错误、词法单元被截断。 Overlong prompts, incorrect formatting, token cutoff. |
提示长度、提示模板一致性、截断率。 Prompt length, prompt-template consistency, truncation rate. |
令牌长度日志、提示模板版本。 Token length logs, prompt-template versions. |
|
一代 Generation |
出现幻觉、语无伦次、无视语境。 Hallucinations, incoherence, context ignoring. |
接地感、流畅性、幻觉发生率、LLM 日志。 Groundedness, fluency, hallucination rate, LLM logs. |
BERTcore、幻觉检查器、WhyLabs、Langfuse 痕迹。 BERTScore, hallucination checker, WhyLabs, Langfuse traces. |
|
评估 Evaluation |
主观质量问题,缺乏反馈机制。 Subjective quality issues, no feedback loop. |
人性化评价、真实性、产出忠实度。 Human ratings, groundedness, output faithfulness. |
BLEU、ROUGE、BERTScore、Ragas、人工注释日志。 BLEU, ROUGE, BERTScore, Ragas, human annotation logs. |
Table 18.1: RAGOps tracking during development
以下列表重点介绍了 RAGOps 在后期开发阶段的目标、实践和目的:
The following list highlights the objectives, practices, and goals of RAGOps during post-development:
在实践中,首先要确定要跟踪哪些指标,然后根据这些跟踪指标设定基准。因此,流程如下:
In practice, you first identify what to track, and then you set benchmarks based on those tracked metrics. So, the sequence is as follows:
1. 确定要跟踪的内容(即,定义与系统目标一致的关键指标)。
1. Identify what to track (i.e., define key metrics aligned with your system's goals).
2. 建立基准(即,为这些指标建立基准值和可接受的阈值)。
2. Establish benchmarks (i.e., establish baseline values and acceptable thresholds for those metrics).
表18.2 ,RAG 系统故障跟踪表,结构化地总结了不同 RAG 系统类型的故障点、识别策略、指标和跟踪方法。您可以在系统评估和部署期间将其用作诊断和操作参考。
Table 18.2, RAG system failure tracking table, presents a structured summary of failure points, identification strategies, metrics, and tracking methods across different RAG system types. You can use it as a diagnostic and operational reference during system evaluation and deployment.
|
RAG系统 RAG system |
关键故障点 Key failure points |
识别策略 Identification strategy |
跟踪方法 Tracking method |
指标 Metrics |
|
单级 RAG Single-stage RAG |
低质量的嵌入、不相关的检索、虚假的生成。 Low-quality embeddings, irrelevant retrieval, hallucinated generation. |
Recall@k、BERTScore、幻觉分析。 Recall@k, BERTScore, hallucination analysis. |
嵌入日志、检索重叠、接地检查。 Embedding logs, retrieval overlap, grounding checks. |
召回率@k、精确率@k、BERTScore、幻觉率。 Recall@k, precision@k, BERTScore, hallucination rate. |
|
两阶段 RAG Two-stage RAG |
初始检索效果差、重排序效果差、上下文不匹配。 Weak initial retrieval, poor reranking, context mismatch. |
首次回忆与重新排序回忆、重新排序得分分析。 First vs. reranked recall, reranker score analysis. |
中间文档日志,重排序器元数据。 Intermediate document logs, reranker metadata. |
Recall@k(重排序前/后)、重排序器得分分布、忠实度。 Recall@k (pre/post rerank), reranker score distribution, faithfulness. |
|
多阶段 RAG Multi-stage RAG |
错误传播、过度过滤、重排序器冲突。 Error propagation, excessive filtering, reranker conflict. |
分阶段消融,整体存在分歧。 Stage-wise ablation, ensemble disagreement. |
阶段日志、重排序器版本控制。 Stage logs, reranker versioning. |
分阶段回忆、整体一致性得分和上下文利用。 Stage-wise recall, ensemble agreement score, and context utilization. |
|
多模态 RAG Multimodal RAG |
模态错位、融合不良、输出无依据。 Modality misalignment, poor fusion, ungrounded outputs. |
跨模态相似性、注意力图分析。 Cross-modal similarity, attention map analysis. |
模态特定日志、融合轨迹和漂移监测。 Modality-specific logs, fusion trace, and drift monitoring. |
CLIP 相似度、VQAScore、跨模态 BERTScore、图像标题 BLEU。 CLIP similarity, VQAScore, cross-modal BERTScore, image caption BLEU. |
|
RAG 中的传统工具 Traditional tool in RAG |
工具误用、误解和 API 故障。 Tool misuse, misinterpretation, and API failure. |
行动与观察不匹配,模式验证。 Action-observation mismatch, schema validation. |
工具调用日志,提示版本控制。 Tool call logs, prompt versioning. |
工具调用准确率、模式匹配率、工具错误率。 Tool invocation accuracy, schema match rate, tool error rate. |
|
代理 RAG Agentic RAG |
计划循环、无效的工具链和目标不一致。 Planning loops, invalid toolchains, and goal misalignment. |
痕迹一致性,链有效性检查。 Trace coherence, chain validity checks. |
完整的跟踪日志,工具错误跟踪。 Full trace logs, tool error tracking. |
代理计划有效性、行动观察一致性、步骤准确性。 Agent plan validity, action-observation alignment, step accuracy. |
|
基于图的RAG Graph-based RAG |
图稀疏/不相关,遍历错误。 Sparse/irrelevant graph, traversal errors. |
图指标,节点相关性评分。 Graph metrics, node relevance scoring. |
遍历日志,边权重跟踪。 Traversal logs, edge weight tracking. |
图覆盖率、节点中心性、边相关性得分。 Graph coverage, node centrality, edge relevance score. |
|
文本到 SQL RAG Text-to-SQL RAG |
架构错误、SQL无效、执行失败。 Wrong schema, invalid SQL, execution failure. |
SQL语法验证、执行测试。 SQL syntax validation, execution testing. |
模式日志、查询结果比较…… Schema logs, query result comparison.. |
SQL 有效性、执行准确率、模式对齐得分。 SQL validity rate, execution accuracy, schema alignment score. |
|
基于OCR的RAG OCR-based RAG |
OCR识别不准确,布局分类错误。 OCR inaccuracy, layout misclassification. |
OCR置信度,文本-视觉比较。 OCR confidence, text-visual comparison. |
OCR日志、检索准确性审核。 OCR logs, retrieval accuracy audits. |
OCR置信度评分、文本提取准确率和检索精度。 OCR confidence score, text extraction accuracy, and retrieval precision. |
Table 18.2: RAG system failure tracking table
在成功开发并初步部署 RAG 系统之后,工作重点转向在实际运行环境中维护系统的可靠性、质量和运行连续性。虽然实时监控、日志记录和用户反馈在生产环境中发挥着重要作用,但基准测试仍然是 RAGOps 框架内一项基本的开发后实践。开发后基准测试确保系统行为始终符合其最初目标,检测出潜在的退化问题,并支持可追溯的质量保证。具体细节如下:
Following the successful development and initial deployment of a RAG system, the focus shifts toward maintaining reliability, quality, and operational continuity in a live environment. While real-time monitoring, logging, and user feedback play an important role in production, benchmarking remains a fundamental post-development practice within the broader RAGOps framework. Post-development benchmarking ensures that system behavior remains aligned with its original objectives, detects silent regressions, and supports traceable quality assurance. The details are as follows:
这种区别至关重要。在动态的生产环境中——查询分布不断变化、索引不断更新、外部系统波动——基准测试提供了一个稳定、不变的参考标准,可以用来衡量系统性的变化。如果没有这种控制,团队只能通过嘈杂、无标签且不断变化的实时数据来解读模型性能,这使得找出性能下降的根本原因变得困难。
This distinction is critical. In a dynamic production environment—where query distributions evolve, indices are updated, and external systems fluctuate—benchmarking offers a stable, invariant reference against which systemic changes can be measured. Without this control, teams are left to interpret model performance through noisy, unlabeled, and ever-changing live data, making it difficult to isolate the causes of performance degradation.
黄金标准数据能够实现以下目标:
这些数据集的结构是固定的,因此可以对性能进行纵向跟踪。此外,企业还可以逐步添加从真实用户数据中提取的高质量、经人工验证的示例,从而实现一种混合基准测试方法,该方法能够随着生产需求的演变而发展,同时又不牺牲可靠性。
Gold-standard data enables the following:
These datasets are frozen in structure, allowing performance to be tracked longitudinally. Moreover, organizations can incrementally augment them with high-quality, human-verified examples derived from live user data—thus enabling a hybrid benchmarking approach that evolves with production needs without sacrificing reliability.
在生产工作流程中,这通常表现为定期基准测试评估(例如,CI/CD 流水线中的夜间运行或部署前检查)。性能指标基于基准数据集计算,并与历史基线进行比较,如果超出可接受的容差范围,则会发出警告。
In production workflows, this often takes the form of scheduled benchmark evaluations (e.g., nightly runs or pre-deployment checks in a CI/CD pipeline). Performance metrics are computed on the benchmark dataset, compared to historical baselines, and flagged if they fall outside acceptable tolerances.
在诊断性能下降时,这一点尤为重要,因为基准测试是暴露于实时用户和数据变化中的系统中唯一不变的基准。
This is especially critical when diagnosing performance drops, as benchmarks provide the only invariant baseline in a system exposed to real-time user and data variability.
在开发后阶段进行基准测试不仅可行,而且是 RAGOps 的关键支柱。它提供了在复杂多变的生产环境中监控系统健康状况所需的客观性、稳定性和可解释性。通过将静态的黄金标准数据集与实时系统洞察相结合,组织可以确保 RAG 系统保持可靠性、可解释性并与运营目标保持一致。因此,基准测试发挥着持续验证机制的作用。
Benchmarking during the post-development phase is not only feasible, it is a critical pillar of RAGOps. It provides the objectivity, stability, and interpretability needed to monitor system health in complex and volatile production settings. By combining static, gold-standard datasets with live system insights, organizations can ensure that RAG systems remain reliable, explainable, and aligned with operational goals. Benchmarking thus acts as the continuous validation mechanism.
基准测试虽然为验证 RAG 系统性能是否符合已知标准提供了一个稳定且可重复的框架,但其本质上是静态且周期性的。相比之下,持续监控能够实时运行,使系统利益相关者能够观察、评估并响应生产环境中出现的性能变化。持续监控是开发后 RAG 运维的重要组成部分,有助于提高运行可靠性、用户信任度、系统弹性以及基于反馈的改进。
While benchmarking provides a stable and repeatable framework for validating RAG system performance against known standards, it is inherently static and periodic. In contrast, continuous monitoring operates in real-time, enabling system stakeholders to observe, evaluate, and respond to performance variations as they unfold in production environments. Continuous monitoring is an essential component of post-development RAGOps, facilitating operational reliability, user trust, system resilience, and feedback-driven improvement.
生产中的 RAG 系统处于动态且往往不可预测的环境中,详情如下:
RAG systems in production are exposed to a dynamic and often unpredictable environment, details as follows:
在这种情况下,仅仅依靠周期性基准测试是不够的。持续监控能够提供实时可观测性,确保偏差、故障或退化能够及早被发现和诊断,通常在用户察觉之前就能解决。
In such settings, relying solely on periodic benchmarks is insufficient. Continuous monitoring provides real-time observability, ensuring that deviations, failures, or regressions are detected and diagnosed early, often before they become user-visible.
RAG 系统中有效的持续监测必须能够捕获不同管道组件的一系列指标。这些指标包括以下几项:
Effective continuous monitoring in RAG systems must capture a range of metrics across different pipeline components. These include the following:
这些指标提供了基础设施健康状况和LLM质量两个维度的运营可见性。
These metrics provide operational visibility across both infrastructure health and LLM quality dimensions.
RAG 系统中用于实施持续监控的工具和方法多种多样,具体如下:
A wide range of tools and methodologies are used to implement continuous monitoring in RAG systems, which are as follows:
将这些系统集成到生产流程中,可以实现实时反馈循环、警报机制和回滚策略。
Integration of these systems into the production pipeline enables real-time feedback loops, alerting mechanisms, and rollback strategies.
成熟的监控系统包含基于阈值的警报功能,当关键指标超出预定义的可接受范围(在基准测试期间设定)时,该系统会通知相关人员。例如:
A mature monitoring system includes threshold-based alerting, which notifies stakeholders when key metrics fall outside predefined acceptable ranges (as set during benchmarking). For instance:
实时仪表盘可持续显示此类指标,并通过时间序列可视化和跟踪比较实现根本原因分析。
Real-time dashboards provide continuous visibility into such metrics and enable root cause analysis through time-series visualizations and trace comparisons.
持续监测不仅是被动的,更是构建自愈和自适应系统的基础。当与主动学习循环、用户反馈或强化信号相结合时,监测到的输出可以提供以下信息:
Continuous monitoring is not only reactive; it is the foundation for building self-healing and adaptive systems. When paired with active learning loops, user feedback, or reinforcement signals, monitored outputs can inform:
持续监控是开发后 RAG 运维中不可或缺的一部分。它确保已部署的系统保持运行质量,快速响应变化,并随着时间的推移不断适应变化。通过捕获检索、生成和基础设施层面的实时信号,并将这些信号转化为可执行的洞察,监控将 RAG 系统从静态部署转变为动态的、可学习的、持续运行的应用,使其在动态的生产环境中保持稳健、可靠,并始终与其预期用途保持一致。
Continuous monitoring is a non-negotiable aspect of post-development RAGOps. It ensures that deployed systems maintain operational quality, respond rapidly to changes, and adapt over time. By capturing real-time signals across retrieval, generation, and infrastructure layers, and translating these signals into actionable insights, monitoring transforms RAG systems from static deployments into living, learning applications that remain robust, trustworthy, and aligned with their intended purpose in dynamic production environments.
要大规模部署 RAG 系统,可观测性是可靠性的基石。目前,丰富的平台生态系统提供追踪、评估、漂移监控和接地诊断等功能,每个平台在 RAGOps 系统中都扮演着独特的角色。以下是一些塑造这一领域的核心工具。
To operationalize RAG systems at scale, observability is the backbone of reliability. A rich ecosystem of platforms now provides tracing, evaluation, drift monitoring, and grounding diagnostics, each filling a distinct role in the RAGOps stack. Below are some of the core tools shaping this space.
Langfuse、Arize Phoenix、WhyLabs 和 MLflow 等基础平台为 RAGOps 提供追踪、评估、漂移监控和提示/版本管理功能。这些是提供全栈可见性的骨干系统。详情如下:
Foundational platforms like Langfuse, Arize Phoenix, WhyLabs, and MLflow that provide tracing, evaluation, drift monitoring, and prompt/version management for RAGOps. These are the backbone systems that give full-stack visibility. The details are as follows:
以下列表概述了诸如 Ragas 和 LlamaIndex 可观测性模块等工具,这些工具专注于合成测试数据生成、基础评估以及为 RAG 流水线提供无缝集成。它们通过 RAG 相关的评估指标来补充核心平台。
The following list outlines tools such as Ragas and LlamaIndex observability module that focus on synthetic test-data generation, grounding evaluation, and seamless instrumentation for RAG pipelines. They complement the core platforms with RAG-focused evaluation metrics.
支持性组件包括OTEL 、RAGViz和InspectorRAGet增强了分布式追踪、可视化和人工与算法混合评估功能。这些功能扩展了可观测性堆栈,以实现更专业的诊断。
Supporting pieces like OTEL, RAGViz, and InspectorRAGet, which enhance distributed tracing, visualization, and hybrid human + algorithmic evaluation. These extend the observability stack for more specialized diagnostics.
这些工具提供端到端的 RAG 可观测性,从提示级别的追踪和生成评估到接地诊断和漂移检测。您可以组合不同的平台(例如,使用 Langfuse 和 Ragas 进行评估,使用 WhyLabs 进行漂移检测,以及使用 OTEL 进行追踪),构建一个稳健的、生产级的 RAGOps 堆栈,以满足您系统的架构和领域需求。基于此,我们来讨论一个基于图的推荐引擎。以下架构的端到端代码可以在 GitHub 代码库中找到。
These tools provide end-to-end RAG observability, from prompt-level tracing and generation evaluation to grounding diagnostics and drift detection. You can combine platforms (e.g., Langfuse with Ragas for evaluation, WhyLabs for drift, and OTEL for tracing) to construct a robust, production-grade RAGOps stack tailored to your system’s architecture and domain requirements. With this understanding, let us discuss a graph-based recommendation engine. The end-to-end code of the following architecture can be found in the GitHub repository.
有了这种可观测性基础,我们现在可以将重点转移到基于图的推荐引擎,探索这些监控原则如何扩展到智能检索和推荐管道。
With this foundation in observability, we can now shift focus to a graph-based recommendation engine, exploring how these monitoring principles extend into intelligent retrieval and recommendation pipelines.
以下架构展示了一个模块化且可扩展的 RAG 推荐系统流程,该系统集成了结构化产品数据、用户偏好、基于图的关系以及神经排序技术。该系统利用 LangChain 进行流程编排,Faiss 进行向量索引,NetworkX 进行图表示,以及基于 Transformer 的嵌入模型进行语义匹配。
The following architecture represents a modular and extensible RAG pipeline for a recommendation system that integrates structured product data, user preferences, graph-based relationships, and neural ranking techniques. The system leverages LangChain for orchestration, Faiss for vector indexing, NetworkX for graph representation, and transformer-based embedding models for semantic matching.
Figure 18.1: Graph-enhanced RAG-based recommendation architecture
设计一个有效的推荐引擎不仅仅是简单的检索,它需要一个多阶段的流程,将语义搜索、基于图的推理和个性化整合起来。以下架构概述了完整的流程,从将原始数据转换为嵌入和结构化图,到协调混合检索、重排序和自然语言生成,最终生成面向用户的推荐内容:
Designing an effective recommendation engine requires more than simple retrieval, it demands a multi-stage pipeline that unifies semantic search, graph-based reasoning, and personalization. The following architecture outlines the complete flow, from transforming raw data into embeddings and structured graphs, to orchestrating hybrid retrieval, reranking, and natural language generation for user-facing recommendations:
此阶段包括查询处理、混合搜索、结果排名和自然语言响应生成,详情如下:
This phase involves query handling, Hybrid search, result ranking, and natural language response generation, details as follows:
该代理还可以访问索引嵌入和图结构,以进行实时决策。
This agent also accesses indexed embeddings and graph structures for real-time decision-making.
关键技术包括:
Key technologies include:
该系统展示了一种稳健的 RAG 架构,并结合了基于图的推理和用户偏好建模。通过整合多种检索方式(语义、结构和个性化),并利用重排序器和语言模型进行最终输出,该系统确保了推荐结果的高度相关性和可解释性。此外,该设计还支持模块化,使其能够适应各种需要产品推荐或内容检索的领域。
This system exemplifies a robust RAG architecture enhanced with graph-based reasoning and user preference modeling. By integrating multiple retrieval modalities (semantic, structural, and personalized) and leveraging a reranker and language model for final output formulation, it ensures high relevance and interpretability in recommendations. The design also supports modularity, enabling adaptability across various domains where product recommendation or content retrieval is required.
该架构的一个显著特点是其智能体设计,它能够在检索和推荐流程中实现智能决策和动态工具选择。系统并非依赖静态的操作序列,而是将控制权委托给一个基于 LangChain 的智能体,该智能体能够执行推理驱动的检索工作流。这种以智能体为中心的方法为流程引入了灵活性、模块化和可解释性,从而能够根据查询复杂性和用户上下文实现自适应交互。
A distinguishing characteristic of this architecture is its agentic design, which enables intelligent decision-making and dynamic tool selection within the retrieval and recommendation pipeline. Rather than relying on a static sequence of operations, the system delegates control to a LangChain-powered agent that is capable of executing a reasoning-driven retrieval workflow. This agent-centric approach introduces flexibility, modularity, and explainability into the pipeline, facilitating adaptive interactions based on query complexity and user context.
在运行时,代理接收用户查询,并自主决定调用哪些检索工具、调用顺序以及如何组合检索结果。这种规划-执行模式使代理能够:
At runtime, the agent receives the user query and autonomously determines which retrieval tools to invoke, in what order, and how to combine the results. This planning-execution paradigm allows the agent to:
该代理还会解释中间观察结果(例如,部分检索结果),并可根据条件重新运行工具,从而增强其进行复杂决策的能力。
The agent also interprets intermediate observations (e.g., partial retrieval results) and can conditionally rerun tools, enhancing its capacity for complex decision-making.
代理控制下的检索系统包含三个专门的工具,每个工具都针对不同的相关性维度,具体如下:
The retrieval system under the agent’s control comprises three specialized tools, each addressing a different dimension of relevance, which are as follows:
该智能体将这些工具集成到一个推理循环中,并利用了 LangChain 的 ReAct 式框架。它并非简单地按预定义的顺序执行工具;相反,它:
The agent integrates these tools into a reasoning loop, leveraging LangChain’s ReAct-style framework. It does not simply execute tools in a predefined order; rather, it:
这种智能协调确保系统能够根据查询类型(例如,事实性、关系性、个性化)、领域结构和用户偏好来调整检索策略。
This agentic orchestration ensures that the system can adapt retrieval strategies based on query type (e.g., factual, relational, personalized), domain structure, and user preferences.
表 18.3,标题为“代理 RAG 系统的开发后故障点和指标” ,详细列出了潜在故障点以及应在系统的每个主要组件中监控的相应关键指标。这种结构化的方法确保了强大的可观测性,从而支持持续的性能评估和生产环境中的运行可靠性。
Table 18.3, titled, Post-development failure points and metrics for agentic RAG system, provides a detailed breakdown of potential failure points and the corresponding key metrics that should be monitored across each major component of the system. This structured approach ensures robust observability, supporting continuous performance evaluation and operational reliability in production.
为确保开发后阶段的运行稳健性和持续性能,必须识别并持续监控代理 RAG 系统各个组件的关键故障点。从嵌入式生成和检索工具到代理编排和输出生成,每个模块都存在独特的风险,这些风险会影响系统的有效性、用户体验和整体可靠性。下表概述了这些关键故障点,并列出了应跟踪的相应指标,以便在生产环境中进行及时诊断、进行明智的优化并确保符合质量标准:
To ensure operational robustness and sustained performance in the post-development phase, it is essential to identify and continuously monitor the key failure points across the various components of the agentic RAG system. Each module, ranging from embedding generation and retrieval tools to agent orchestration and output generation, presents unique risks that can impact system effectiveness, UX, and overall reliability. The following table outlines these critical failure points and specifies the corresponding metrics that should be tracked to enable timely diagnostics, informed optimization, and adherence to quality standards in a production environment:
|
成分 Component |
故障点 Failure points |
关键指标 Key metrics |
|
嵌入生成 Embedding generation |
过时的嵌入、低质量的向量、不一致的格式。 Stale embeddings, low-quality vectors, inconsistent formats. |
嵌入漂移分数、覆盖率和更新频率。 Embedding drift score, coverage ratio, and update frequency. |
|
矢量搜索工具 Vector search tool |
语义回忆率低、检索延迟高、前 k 个结果不相关。 Low semantic recall, retrieval latency, and irrelevant top-k results. |
召回率@k、精确率@k、查询延迟和语义重叠分数。 Recall@k, precision@k, query latency, and semantic overlap score. |
|
图搜索工具 Graph search tool |
节点断开、遍历路径稀疏、图不匹配。 Disconnected nodes, sparse traversal paths, and graph mismatch. |
节点连通性、平均路径长度和图命中率。 Node connectivity, average path length, and graph hit rate. |
|
混合搜索工具 Hybrid search tool |
融合逻辑不一致、过度拟合单一数据源、多样性低。 Inconsistent fusion logic, overfitting to one source, and low diversity. |
评分一致率、结果多样性指数和检索一致性。 Score agreement rate, result diversity index, and retrieval consistency. |
|
代理编排 Agent orchestration |
工具选择无效、执行计划失败、未处理的异常。 Invalid tool selection, failed execution plans, unhandled exceptions. |
工具成功率、计划执行时间和工具链准确性。 Tool success rate, plan execution time, and toolchain accuracy. |
|
重排序(交叉编码器) Reranking (cross-encoder) |
排名错误、延迟瓶颈和不准确的重排序。 Misranking, latency bottleneck, and unfaithful reordering. |
重新排序得分相关性、延迟、前 1 相关性准确率。 Rerank score correlation, latency, top-1 relevance accuracy. |
|
LLM包装推荐 LLM wrapping for recommendation |
无根据的产生、幻觉、语无伦次。 Ungrounded generation, hallucination, incoherence. |
忠诚度评分、幻觉率、BERTScore。 Faithfulness score, hallucination rate, BERTScore. |
|
端到端响应质量 End-to-end response quality |
个性化程度低、参与度低、事实不一致。 Poor personalization, low engagement, and factual inconsistency. |
用户评分、真实度、会话参与度。 User rating score, groundedness rate, and session engagement rate. |
表 18.3:基于图的代理 RAG 系统的开发后故障点和指标
Table 18.3: Post-development failure points and metrics for graph-based agentic RAG systems
在现代软件系统中,DevOps、MLOps 和 RAGOps 扮演着既独特又互补的角色,当它们被整合起来时,就能实现可扩展、智能且具有弹性的应用程序。DevOps它专注于软件开发和部署工作流程的自动化,确保集成、交付、监控和基础设施管理的一致性。它为持续集成/持续交付 (CI/CD) 流水线、测试框架和系统可靠性奠定了基础。
In modern software systems, DevOps, MLOps, and RAGOps serve distinct yet complementary roles that, when integrated, enable scalable, intelligent, and resilient applications. DevOps focuses on the automation of software development and deployment workflows, ensuring consistent integration, delivery, monitoring, and infrastructure management. It lays the foundation for CI/CD pipelines, testing frameworks, and system reliability.
MLOps 将 DevOps 原则扩展到机器学习( ML ) 工作流程。它通过可复现的训练流程、数据集和模型的版本控制、自动化部署以及对模型性能的长期监控,实现模型的运维。MLOps 确保 ML 模型在部署后仍然保持可靠性、适应性和可控性。
MLOps extends DevOps principles to machine learning (ML) workflows. It enables the operationalization of models through reproducible training pipelines, versioning of datasets and models, automated deployment, and monitoring of model performance over time. MLOps ensures that ML models remain reliable, adaptive, and governed after they are deployed.
RAGOps 基于这些基础架构,为 RAG 系统提供支持,特别是那些结合了向量搜索、检索逻辑和 LLM 的系统。RAGOps 引入了基于 LLM 的应用特有的可观测性和评估挑战,例如监测接地质量、幻觉率和检索忠实度。它还解决了检索、重排序和生成组件之间的可追溯性问题。
RAGOps builds on these foundations to support RAG systems, particularly those combining vector search, retrieval logic, and LLMs. RAGOps introduces new observability and evaluation challenges unique to LLM-based applications, such as monitoring grounding quality, hallucination rates, and retrieval faithfulness. It also addresses traceability across retrieval, reranking, and generation components.
DevOps 共同确保系统稳定性,MLOps 确保模型完整性,RAGOps 确保提示级推理的可追溯性和检索质量。当它们协同运作时,便能实现 GenAI 应用的持续开发、部署和改进,从而将传统工程的可靠性与生成式智能相结合。将 MLflow 与推荐系统集成。
Together, DevOps ensures system stability, MLOps ensures model integrity, and RAGOps ensures prompt-level reasoning traceability and retrieval quality. When orchestrated cohesively, they enable the continuous development, deployment, and refinement of GenAI applications, bridging classical engineering reliability with generative intelligence. Integrating MLflow with a recommendation system.
下图展示了我们的端到端电影推荐架构:一个基于 LangGraph 的检索管道,将用户查询转换为Cypher ,在Neo4j上执行,并使用 Mistral 汇总结果;同时,附加的可观测性层将忠实度和相关性指标记录到MLflow,以进行持续的质量监控:
The following figure illustrates our end-to-end movie-recommendation architecture: a LangGraph-driven retrieval pipeline that turns user queries into Cypher, executes them on Neo4j, and summarises the results with Mistral, while an attached observability layer logs faithfulness and relevance metrics to MLflow for continuous quality monitoring:
为了给检索增强型语言模型系统配备严格的实验日志记录功能,第一个前提条件是安装 MLflow,它是事实上的模型跟踪开源平台。
In order to instrument retrieval-augmented language model systems with rigorous experiment logging, the first prerequisite is the installation of MLflow, the de facto open-source platform for model tracking.
使用 pip 安装 MLflow
pip install MLflow
它安装了客户端库,该库公开了核心编程接口,例如MLflow.log_param、MLflow.log_metric和MLflow.start_run ,以及 GenAI 特有的扩展MLflow.evaluate ,后者实现了质量评估。这些 API 使研究人员能够以可复现、可查询的形式捕获每个实验产物(超参数、检索结果、生成的答案、评估分数)。
It installs the client library that exposes the core programmatic interface, e.g., MLflow.log_param, MLflow.log_metric, and MLflow.start_run, as well as the GenAI-specific extension MLflow.evaluate, which implements the quality assessment. These APIs enable researchers to capture every experimental artefact (hyperparameters, retrieval hits, generated answers, evaluation scores) in a reproducible, queryable form.
为了在本地可视化和比较这些运行结果,可以启动一个轻量级跟踪服务器,将结果持久化到文件系统中。如果 Python Web 服务器网关接口( WSGI ) 容器 Waitress 尚未安装,可以通过以下方式添加:
To visualise and compare such runs locally, one can launch a lightweight tracking server that persists results to the file system. If the Python Web Server Gateway Interface (WSGI) container Waitress is not already present, it may be added via:
pip install waitress
pip install waitress
随后,只需一条命令即可公开 MLflow REST 和 Web 界面:
Subsequently, a single command suffices to expose the MLflow REST and web interface:
waitress-serve --host 127.0.0.1 --port 5000 MLflow.server:app
waitress-serve --host 127.0.0.1 --port 5000 MLflow.server:app
这将启动位于http://127.0.0.1:5000 的仪表板,如图18.3所示,研究人员可以从中检查每次运行的参数历史记录、指标轨迹和工件,从而完成 RAG 管道的可观测性循环:
This spins up the dashboard at http://127.0.0.1:5000, as shown in Figure 18.3, from which investigators can inspect parameter histories, metric trajectories, and artefacts for every run, thereby closing the observability loop for RAG pipelines:
图 18.2所示的解决方案仅用于说明如何将基于图的推荐流程(或更广泛地说,GenAI 工作流)与 MLflow 集成,以实现实验跟踪和可观测性。详细解释如下:本书不涉及Neo4j、Text2Cypher 和 Relik命名实体模型( NER ) 等底层组件及其实现,这些内容在BPB 作者 Indrajit Kar所著的《Learn Python Generative AI, Version 2》一书中已有详尽介绍。本章仅聚焦于使用 MLflow 构建可观测性管道。
The solution depicted in Figure 18.2 serves solely to illustrate how a graph-based recommendation pipeline, or more broadly, a GenAI workflow, can be instrumented and integrated with MLflow for experiment tracking and observability. Detailed explanations of the underlying components like Neo4j, Text2Cypher, and the Relik name entity model (NER) and their implementation are beyond the scope of this book and are covered extensively in Learn Python Generative AI, Version 2 by BPB author Indrajit Kar. This current chapter focuses exclusively on the observability pipeline with MLflow.
在使用 LangChain 代理和 Ollama 模型的基于图的推荐系统中,可观测性和评估在确保信任、可解释性和系统调试方面起着至关重要的作用。本研究探讨了将 MLflow 集成到此类流程中的两种不同方法:
In the context of graph-based recommendation systems using LangChain agents and Ollama models, observability and evaluation play a crucial role in ensuring trust, explainability, and system debugging. This study examines two distinct approaches for integrating MLflow into such a pipeline:
以下列表概述了该方法的详细内容:
The following list outlines the details of the approach:
这两种方法都用于对 LLM 支持的密码生成器和答案合成器进行检测,但它们在追踪策略、复杂性和范围方面存在根本差异。
Both approaches are used to instrument an LLM-backed Cypher generator and answer synthesizer, but they differ fundamentally in tracing strategy, complexity, and scope.
这种方法使用 MLflow 的底层追踪 API 对 ollama.chat() 函数进行显式插桩。MLflow_ollama_patch.py 文件将基于装饰器的补丁应用到 Ollama:
This approach explicitly instruments the ollama.chat() function using MLflow’s low-level tracing API. The MLflow_ollama_patch.py file applies a decorator-based patch to Ollama:
MLflow_ollama_patch.py 模块提供了一个最小化的追踪接口,它使用 MLflow 的底层 span API 来追踪所有对ollama.chat()的调用。这是通过trace_ollama_chat装饰器实现的,该装饰器封装了原始函数:
from MLflow.tracing import trace, SpanType
...
@trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)
def trace_ollama_chat(func):
@wraps(func)
def wrapper(*args, **kwargs):
...
span.set_inputs({"messages": kwargs["messages"], "model": model})
...
span.set_outputs(response)
返回响应
返回包装器
该补丁是动态应用的:
ollama.chat = trace_ollama_chat(ollama.chat)
The MLflow_ollama_patch.py module serves as a minimal tracing interface that instruments all calls to ollama.chat() using MLflow’s low-level span API. This is achieved through the trace_ollama_chat decorator, which wraps the original function:
from MLflow.tracing import trace, SpanType
...
@trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)
def trace_ollama_chat(func):
@wraps(func)
def wrapper(*args, **kwargs):
...
span.set_inputs({"messages": kwargs["messages"], "model": model})
...
span.set_outputs(response)
return response
return wrapper
The patch is applied dynamically:
ollama.chat = trace_ollama_chat(ollama.chat)
这种方法确保每次调用 LLM 都会被记录为一个 span,其中包含完整的输入提示、模型名称和响应。这些 span 会作为执行跟踪的一部分在 MLflow UI 中可视化显示,使开发人员能够审核工具使用情况、分析提示-响应行为以及诊断函数级别的错误。
This method ensures that each invocation of the LLM is recorded as a span, complete with input prompts, model names, and responses. These spans are visualized in the MLflow UI as part of the execution trace, enabling developers to audit tool use, analyze prompt-response behavior, and diagnose errors at the function level.
在对应的main_with_m_patch.py 文件中,用户无需添加除记录参数和输出之外的额外逻辑。所有 LLM 调用都会被自动跟踪。
In the corresponding main_with_m_patch.py, the user does not need to add extra logic beyond logging parameters and outputs. All LLM calls are automatically traced.
文件main_with_m_patch.py 将基于 LangChain 代理的问答系统与 MLflow 跟踪功能集成在一起,同时使用自定义模块MLflow_ollama_patch对ollama.chat()调用进行底层跟踪。该脚本既能对最终答案进行语义质量评估,又能对 LLM 交互进行执行级别的跟踪。以下是详细的分解和说明:
The file main_with_m_patch.py integrates a LangChain agent-based QA system with MLflow tracking, while enabling low-level tracing of ollama.chat() calls using the custom module MLflow_ollama_patch. This script offers both semantic quality evaluation of the final answer and execution-level tracing of LLM interactions. The following is a structured breakdown and explanation:
用户问题 = 输入(“提出您的问题:”)
schema = """(:Movie {title, genre, mood, release_year}) ..."""
该脚本捕获自由文本查询,并定义知识图谱模式,概述电影、演员、导演和平台节点的结构及其关系。
user_question = input("Ask your question: ")
schema = """(:Movie {title, genre, mood, release_year}) ..."""
The script captures a free-text query and defines a knowledge graph schema outlining the structure of movie, actor, director, and platform nodes and their relationships.
result = app.invoke(inputs)
cypher_query = result.get("cypher_query", "")
LangGraph 代理处理用户问题和模式,生成 Cypher 查询(cypher_query ),在 Neo4j 图数据库(query_results )上执行该查询,并通过 Ollama 使用 Mistral 模型生成自然语言答案(final_answer )。
result = app.invoke(inputs)
cypher_query = result.get("cypher_query", "")
The LangGraph agent processes the user question and schema to generate a Cypher query (cypher_query), execute it on a Neo4j graph database (query_results), and generate a natural language answer (final_answer) using the Mistral model via Ollama.
使用MLflow.start_run(run_name="CypherTest_Run1")作为 run:
启动一个已命名的 MLflow 运行。在try代码块中,系统日志记录如下:
MLflow.log_param("question", user_question)
MLflow.log_param("cypher_query", cypher_query or "EMPTY")
MLflow.log_text(json.dumps(query_results, indent=2), "neo4j_context.json")
MLflow.log_text(final_answer or "EMPTY", "final_answer.txt")
faith_score = evaluate_faithfulness_with_ollama(...)
rel_score = evaluate_relevance_with_ollama(...)
MLflow.log_metric("faithfulness", faith_score)
MLflow.log_metric("相关性", rel_score)
这些评分衡量最终答案是否反映数据库事实(忠实度)以及是否符合用户的查询意图(相关性)。为确保稳健性,在出现异常情况时,系统会记录0.0的备用评分,从而保证 MLflow 日志的完整性和可追溯性。
通过在 MLflow 中捕获这些指标,评估过程成为 RAGOps 可观测性管道的核心部分——从而实现对检索和生成组件的持续监控、故障诊断和数据驱动改进。
导入 MLflow_ollama_patch
在内部,MLflow_ollama_patch.py 使用 MLflow span 日志记录器包装了ollama.chat() :
@trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)
def trace_ollama_chat(func):
...
ollama.chat = trace_ollama_chat(ollama.chat)
这会将每次ollama.chat()调用记录为 MLflow UI 中可追踪的聊天模型跨度,捕获以下内容:
这使得对提示演变和模型行为进行细粒度的观察成为可能,而这在标准的MLflow.log_*()调用中是不可见的。
with MLflow.start_run(run_name="CypherTest_Run1") as run:
A named MLflow run is started. Inside the try block, the system logs:
MLflow.log_param("question", user_question)
MLflow.log_param("cypher_query", cypher_query or "EMPTY")
MLflow.log_text(json.dumps(query_results, indent=2), "neo4j_context.json")
MLflow.log_text(final_answer or "EMPTY", "final_answer.txt")
faith_score = evaluate_faithfulness_with_ollama(...)
rel_score = evaluate_relevance_with_ollama(...)
MLflow.log_metric("faithfulness", faith_score)
MLflow.log_metric("relevance", rel_score)
These scores measure whether the final answer reflects database facts (faithfulness) and whether it aligns with the user’s query intent (relevance). To ensure robustness, fallback scores of 0.0 are recorded in case of exceptions, allowing MLflow logs to remain complete and traceable.
By capturing these metrics within MLflow, the evaluation process becomes a core part of the RAGOps observability pipeline—enabling continuous monitoring, failure diagnosis, and data-driven improvement of both retrieval and generation components.
import MLflow_ollama_patch
Internally, MLflow_ollama_patch.py wraps ollama.chat() with an MLflow span logger:
@trace(name="ollama.chat", span_type=SpanType.CHAT_MODEL)
def trace_ollama_chat(func):
...
ollama.chat = trace_ollama_chat(ollama.chat)
This records every ollama.chat() call as a traceable chat model span in MLflow’s UI, capturing:
This enables fine-grained observability of prompt evolution and model behavior, which is invisible in standard MLflow.log_*() calls.
以下柱状图展示了忠实度和相关性的语义评估得分,两项得分均为 3.0,表明聊天机器人生成的回复与 Neo4j 查询结果之间存在中等程度的一致性。这些指标由 MLflow 可观测性管道自动计算并记录。
The following bar charts represent semantic evaluation scores for faithfulness and relevance, both scoring 3.0, indicating moderate alignment between the chatbot’s generated response and the Neo4j query results. These metrics are automatically computed and logged via MLflow the observability pipeline.
图 18.4:MLflow UI 屏幕截图,显示了名为 CypherTest_Run1 的运行的模型级指标。
Figure 18.4: Screenshot of MLflow UI displaying model-level metrics for the run named CypherTest_Run1
相比之下,main.py脚本利用 MLflow 的第一方 GenAI 指标 API 来评估最终模型输出的质量,而不是追踪中间工具的使用情况。其重点在于评估 Ollama 生成的最终答案的准确性和相关性。
In contrast, the main.py script leverages MLflow’s first-party GenAI metrics API to assess the quality of final model outputs, rather than tracing intermediate tool usage. The focus is on evaluating the faithfulness and relevance of the final answer generated by Ollama.
例如:
For instance:
from MLflow.metrics.genai import faithfulness, relevance
from MLflow.metrics.genai import faithfulness, relevance
faith_score = faithfulness(model="ollama:/mistral")(
faith_score = faithfulness(model="ollama:/mistral")(
predictions=[final_answer],
predictions=[final_answer],
输入=[用户问题],
inputs=[user_question],
context=[json.dumps(query_results)]
context=[json.dumps(query_results)]
).scores[0]
).scores[0]
此调用使用第二个 LLM 来评估生成的答案是否与检索到的知识和提示一致。同样的方法也应用于relevance()指标。这些标量分数使用以下方式记录:
This call uses a second LLM to evaluate whether the generated answer is consistent with the retrieved knowledge and prompt. The same approach is applied to the relevance() metric. These scalar scores are logged using:
MLflow.log_metric("faithfulness", faith_score)
MLflow.log_metric("faithfulness", faith_score)
MLflow.log_metric("相关性", rel_score)
MLflow.log_metric("relevance", rel_score)
该方法不需要修补或自定义跨度,并且与端到端输出验证相一致,因此特别适用于 RAG 管道或摘要系统。
This method does not require patching or custom spans, and it is aligned with end-to-end output validation, making it particularly suitable for RAG pipelines or summarization systems.
当重点在于工具审计、中间推理链分析或基于跨度的机器学习可观测性时,补丁式追踪方法是最佳选择。同时,直接MLflow。metrics.genai方法适用于评估 LLM 输出的质量,尤其适用于答案可信度至关重要的 RAG 系统。对于完整的流程,这些方法可以互补,同时使用跨度和分数来实现全栈 GenAI 可观测性。
The patched tracing approach is optimal when the focus is on tool auditing, intermediate reasoning chain analysis, or span-based ML observability. Meanwhile, the direct MLflow.metrics.genai method is suitable for evaluating the quality of LLM output, especially in RAG systems where answer trustworthiness matters. For a complete pipeline, these approaches can be complementary, using both spans and scores for full-stack GenAI observability.
为了展示 MLflow 在基于图的推荐流程中提供语义评估和实验可观测性的实用性,我们记录并可视化了输出质量指标,例如忠实度和相关性。这些指标通过 Ollama 使用 Mistral 进行评估计算得出,并针对每次代理运行自动记录。MLflow 控制面板左侧的名称如下所示:
To illustrate the utility of MLflow in providing semantic evaluation and experiment observability within a graph-based recommendation pipeline, we log and visualize output quality metrics such as faithfulness and relevance. These metrics are computed using the evaluation with Mistral via Ollama and automatically recorded for each agentic run. The names at the left side of MLflow dashboard, like the following:
这些运行名称由 MLflow 自动生成。
They are automatically generated run names by MLflow.
这些是便于理解的运行标识符,旨在帮助您在用户界面中快速区分它们。如果您没有显式地为运行设置名称,MLflow 默认会分配这些标识符。
These are human-friendly identifiers for your runs, meant to help you quickly distinguish them in the UI. MLflow assigns them by default if you do not explicitly set a name for a run.
如果您更喜欢有意义的名称(例如CypherTest_Run 1 ),您可以在代码中显式地为运行命名:
If you would prefer meaningful names (like CypherTest_Run1), you can explicitly name a run in your code:
使用MLflow.start_run(run_name="CypherTest_Run1")作为 run:
With MLflow.start_run(run_name="CypherTest_Run1") as run:
图 18.5:MLflow 控制面板,显示使用模型级指标对多次运行进行比较评估的结果。
Figure 18.5: MLflow dashboard displaying comparative evaluation of multiple runs using model-level metrics
图 18.5中的每个柱状图都对应一次独立的模型运行(例如awesome-roo-333 、masked-hound-282 ),这些运行结果会被自动记录并用颜色编码以方便区分。忠实度和相关性图表通过 Mistral 工具捕捉聊天机器人回复与 Neo4j 查询输出之间的语义一致性。这种对比可视化有助于开发者和研究人员评估不同实验中的相对性能,从而系统地改进基于图的推荐流程。
Each bar in Figure 18.5, corresponds to a unique model run (e.g., awesome-roo-333, masked-hound-282), automatically logged and color-coded for visual differentiation. The faithfulness and relevance charts capture semantic alignment between chatbot responses and Neo4j query outputs via Mistral. This comparative visualization helps developers and researchers assess relative performance across experiments, enabling systematic refinement of graph-based recommendation pipelines.
在 MLflow 配置为使用本地文件后端(而非远程服务器或 SQL 存储)的情况下,了解 mlruns 文件夹的目录结构对于诊断跟踪和日志记录问题至关重要。本附录概述了 MLflow 运行级别日志和元数据的结构和解释,并提供了故障排除的实用指南:
In scenarios where MLflow is configured with a local file-based backend (as opposed to a remote server or SQL store), understanding the directory layout of the mlruns folder is crucial for diagnosing tracking and logging issues. This appendix outlines the structure and interpretation of MLflow's run-level logs and metadata, and provides practical guidelines for troubleshooting:
执行命令时:
When executing the command:
ls -l mlruns/0
ls -l mlruns/0
输出结果显示默认实验(实验 ID 0 )中各个 MLflow 运行对应的子目录。每个目录名称都是一个通用唯一标识符( UUID ),代表一次特定的运行。示例列表可能如下所示:
The output displays subdirectories corresponding to individual MLflow runs within the default experiment (experiment ID 0). Each directory name is a universally unique identifier (UUID) representing a specific run. An example listing may appear as:
drwxr-xr-x 7 <您的用户名> <您的用户名>224 7月17日 15:16 068e9e07e75f40efa5c360225157a3ed
drwxr-xr-x 7 <Your user name> <Your user name>224 17 Jul 15:16 068e9e07e75f40efa5c360225157a3ed
drwxr-xr-x 7 <您的用户名> <您的用户名>224 7月17日 16:28 7776a3d0407f45b8b8cd2a92caa4645b
drwxr-xr-x 7 <Your user name> <Your user name>224 17 Jul 16:28 7776a3d0407f45b8b8cd2a92caa4645b
-rw-r--r-- 1 <您的用户名> <您的用户名>212 7月17日 15:16 meta.yaml
-rw-r--r-- 1 <Your user name> <Your user name>212 17 Jul 15:16 meta.yaml
每个子目录包含与每次运行相关的日志、参数、指标和元数据。顶层meta.yaml文件存储实验级别的信息,例如实验名称和生命周期状态。
Each subdirectory contains the logs, parameters, metrics, and metadata associated with an individual run. The top-level meta.yaml file stores experiment-level information, such as the experiment name and lifecycle status.
使用以下命令检查单个运行目录:
Inspecting an individual run directory using:
ls -l mlruns/0/7776a3d0407f45b8b8cd2a92caa4645b/
ls -l mlruns/0/7776a3d0407f45b8b8cd2a92caa4645b/
生成如下列表:
produces a listing such as:
drwxr-xr-x 4 <您的用户名> <您的用户名>128 7月17日 16:28 artifacts
drwxr-xr-x 4 <Your user name> <Your user name>128 17 Jul 16:28 artifacts
-rw-r--r-- 1 <您的用户名> <您的用户名>395 7月17日 16:28 meta.yaml
-rw-r--r-- 1 <Your user name> <Your user name>395 17 Jul 16:28 meta.yaml
drwxr-xr-x 4 <您的用户名> <您的用户名>128 7月17日 16:28 指标
drwxr-xr-x 4 <Your user name> <Your user name>128 17 Jul 16:28 metrics
drwxr-xr-x 4 <您的用户名> <您的用户名>128 7月17日 16:28 参数
drwxr-xr-x 4 <Your user name> <Your user name>128 17 Jul 16:28 params
drwxr-xr-x 6 <您的用户名> <您的用户名>192 7月17日 16:28 标签
drwxr-xr-x 6 <Your user name> <Your user name>192 17 Jul 16:28 tags
每个部件都有其独特的用途。
Each component serves a distinct purpose.
下表概述了这些核心目录和文件,并总结了它们在捕获工件、记录指标、存储参数和记录关键元数据方面的基本功能。这种基础结构确保了每次机器学习运行都采用一致且有序的方法,从而更容易审核结果、比较实验并在整个工作流程中保持可复现性。
The following table presents an overview of these core directories and files, summarizing their essential functions in capturing artifacts, logging metrics, storing parameters, and recording critical metadata. This foundational structure ensures a consistent and organized approach for every ML run, making it easier to audit results, compare experiments, and maintain reproducibility throughout the workflow.
|
成分 Component |
描述 Description |
|
文物/ artifacts/ |
存储使用MLflow.log_artifact()或log_model()记录的外部文件。 Stores external files logged using MLflow.log_artifact() or log_model(). |
|
指标/ metrics/ |
包含以单独 JSON 文件形式存储的带时间戳的指标日志。 Contains time-stamped metric logs stored as individual JSON files. |
|
参数/ params/ |
包含通过MLflow.log_param()记录的键值对参数。 Contains parameters logged via MLflow.log_param() as key-value pairs. |
|
标签/ tags/ |
包含元数据标签,例如运行名称、来源和用户。 Contains metadata tags such as run name, source, and user. |
|
meta.yaml meta.yaml |
存储运行元数据,包括运行状态、开始/结束时间和用户信息。 Stores run metadata, including run status, start/end times, and user info. |
Table 18.4: Brief description of each key component found within a run directory
UI 中显示的运行名称(例如traveled-ant-386 )存储为标签,可以在tags/目录中找到。
The run name visible in the UI (e.g., traveling-ant-386) is stored as a tag and can be found within the tags/ directory.
本地的mlruns/目录结构提供了一种透明且易于访问的方式,以便在不使用远程后端的情况下使用 MLflow 时检查和调试实验跟踪。每个实验都通过其 ID 映射到一个目录,每次运行都存储在一个唯一命名的文件夹中,该文件夹包含标准化的子目录,分别用于存储指标、参数、工件和元数据。理解这种结构对于构建稳健的机器学习可观测性系统的开发人员至关重要,尤其是在基于 LLM 的实验性流水线中,可复现性和可追溯性是其基础。
The local mlruns/ directory structure provides a transparent and accessible way to inspect and debug experiment tracking when using MLflow without a remote backend. Each experiment is mapped to a directory by its ID, and every run is stored in a uniquely named folder with standardized subdirectories for metrics, parameters, artifacts, and metadata. Understanding this structure is essential for developers building robust ML observability systems, especially in the context of experimental LLM-based pipelines, where reproducibility and traceability are foundational.
通过本地文件系统结构对 MLflow 进行故障排除,可以直接了解运行、工件和元数据的组织方式。通过检查运行文件夹、日志和参数文件,用户可以快速定位问题、验证实验完整性并确保可复现性,因此,文件系统级别的探索是 MLflow 调试中切实可行的第一步。
Troubleshooting MLflow through the local filesystem structure offers direct visibility into how runs, artifacts, and metadata are organized. By inspecting run folders, logs, and parameter files, practitioners can quickly isolate issues, verify experiment integrity, and ensure reproducibility, making filesystem-level exploration a practical first step in MLflow debugging.
随着我们即将结束这段多模态生成人工智能(GenAI)之旅,我们必须认识到,尽管多模态系统代表着前沿领域,使模型能够跨文本、图像、音频等多种数据进行联合推理,但它们仍然建立在传统生成模型的坚实基础之上。如需深入了解这些基础知识,请参阅《学习Python生成人工智能:从自编码器到Transformer再到大型语言模型》,该书探讨了为当今多模态突破铺平道路的核心架构。
As we come to the end of this journey into multimodal GenAI, it is important to recognize that while multimodal systems represent the frontier, enabling models to reason jointly across text, images, audio, and beyond, they are built upon a strong foundation of traditional generative models. For a deeper dive into these fundamentals, refer to Learn Python Generative AI: Journey from Autoencoders to Transformers to Large Language Models, which explores the core architectures that paved the way for today’s multimodal breakthroughs.
在本章乃至本书即将完结之际,我们回顾了这段探索之旅,它横跨了RAG系统运维的复杂领域以及更广泛的GenAI领域。从对MLflow中基础运行目录结构和故障排除的探索,到生产级GenAI应用中可观测性、评估和可追溯性的细微要求,每一部分都逐步构建了对稳健AI系统部署的整体理解。DevOps、MLOps和RAGOps的比较,阐明了智能系统管理范式的演变,这些范式将软件工程和生成推理交织在一起。
As we draw the final lines of this chapter, and indeed, this book, we reflect on a journey that has traversed the complex terrain of operationalizing RAG systems and the broader landscape of GenAI. From our exploration of foundational run directory structures and troubleshooting in MLflow, to the nuanced requirements of observability, evaluation, and traceability in production-grade GenAI applications, each section has built toward a holistic understanding of robust AI system deployment. The comparison of DevOps, MLOps, and RAGOps illuminated the evolving paradigms for managing intelligent systems that intertwine software engineering and generative reasoning.
本书通过包括 MLflow 工具和图增强推荐系统在内的实践案例,将理论与实践相结合,强调了可复现性、透明度和问责制这三大支柱对于人工智能驱动创新未来至关重要。综上所述,这些方法论并非仅仅是技术上的必要条件,而是构建合乎伦理且可持续的人工智能发展的基石。
The hands-on examples, including MLflow instrumentation and graph-enhanced recommender systems, rooted theory in practice, emphasizing that the pillars of reproducibility, transparency, and accountability are vital for the future of AI-driven innovation. As we close, it is clear that these methodologies are not mere technical necessities but the foundation for ethical and sustainable AI development.
一个
A
溯因推理 269
Abductive Reasoning 269
智能体人工智能 92
Agentic AI 92
Agentic AI/AI Agents, comparing 127, 128
智能体人工智能,架构 92
Agentic AI, architecture 92
智能体人工智能,术语
Agentic AI, terms
代理 SDK 93
Agents SDK 93
助手 API 94
Assistants API 94
法典 94
Codex 94
93号操作员
Operator 93
响应 API 92
Response API 92
智能体 GenAI 106
Agentic GenAI 106
智能体GenAI,模式
Agentic GenAI, pattern
聚合器 109
Aggregator 109
评论家/验证者 115
Critic/Validator 115
数据库 113
Database 113
层级 111
Hierarchical 111
人机交互 111
Human-in-the-Loop 111
循环 108
Loop 108
记忆转换 113
Memory Transformation 113
多模态代理 116
Multimodal Agent 116
谈判者 116
Negotiator 116
110号网络
Network 110
平行线 106
Parallel 106
规划执行者 114
Planner-Executor 114
路由器 109
Router 109
顺序 107
Sequential 107
共享工具 112
Shared Tools 112
主管-下属 118
Supervisor-Subordinate 118
时间规划器 120
Temporal Planner 120
投票/共识 117
Voting/Consensus 117
监督/恢复 119
Watchdog/Recovery 119
Agentic RAG/Non-Agentic RAG 30, 31
类比推理 270
Analogical Reasoning 270
自回归 第九代
Autoregressive Generation 9
自回归生成策略
Autoregressive Generation, strategies
温度 10
Temperature 10
Top-k 抽样 10
Top-k Sampling 10
顶级抽样 10
Top-p Sampling 10
B
B
双编码器/交叉编码器 24
Bi-Encoders/Cross-Encoders 24
双编码器/交叉编码器,模式
Bi-Encoders/Cross-Encoders, pattern
双编码器 24
Bi-Encoders 24
交叉编码器 24
Cross-Encoders 24
C
C
云端LLM 189
Cloud LLMs 189
云计算LLM、概念
Cloud LLMs, concepts
法学硕士(LLM)法官 190
LLM-as-a-Judge 190
原理/功能 190
Rationale/Functionality 190
代码实现 205
Code Implementation 205
代码实现,组件
Code Implementation, components
ChromaDB 206
ChromaDB 206
配置管理 205
Configuration Management 205
数据加载器 206
Data Loaders 206
嵌入函数 205
Embedding Functions 205
Code Implementation, ensuring 206-208
ColBERT/ColPali 139
ColBERT/ColPali 139
ColBERT/ColPali,能力 139
ColBERT/ColPali, capabilities 139
Commonsense Reasoning 270, 271
连续监测 404
Continuous Monitoring 404
持续监测点
Continuous Monitoring, points
异常检测 405
Anomaly Detection 405
自愈系统 406
Self-Healing Systems 406
持续监测,来源
Continuous Monitoring, sources
RAGOps 404
RAGOps 404
RAG Systems 404
RAG Systems 404
连续监测技术
Continuous Monitoring, techniques
自定义评估器 405
Custom Evaluators 405
漂移检测 405
Drift Detection 405
日志/跟踪 405
Logging/Tracing 405
可观测性平台 405
Observability Platforms 405
交叉编码器 198
Cross-Encoder 198
交叉编码器,架构 198
Cross-Encoder, architecture 198
Cross-Encoder, embedding 201, 202
Cross-Encoder/Late Interaction, comparing 198, 199
跨模态交互 160
Cross-Modal Interaction 160
Cross-Modal Interaction, functionalities 160-162
Cross-Modal Interaction, illustrating 169-171
跨模态交互术语
Cross-Modal Interaction, terms
数据目录 166
Data Directory 166
前端 163
Frontend 163
装载机 167
Loaders 167
检索系统 166
Retrieval System 166
D
D
数据可访问性 321
Data Accessibility 321
数据可访问性条款
Data Accessibility, terms
好奇号 323
Curiosity 323
数据治理/可追溯性 323
Data Governance/Traceability 323
数据素养 322
Data Literacy 322
数据民主化 322
Democratizing Data 322
全球访问 323
Global Access 323
实时决策 322
Real Time Decision-Making 322
技术桥梁 321
Technical Bridging 321
数据摄取管道 409
Data Ingestion Pipeline 409
数据摄取管道,建议 409
Data Ingestion Pipeline, recommendations 409
演绎推理 268
Deductive Reasoning 268
E
E
实体提取 317
Entity Extraction 317
Entity Extraction, implementing 318-321
实体提取,工作流 318
Entity Extraction, workflow 318
F
F
少镜头提示 283
Few-Shot Prompting 283
少次提示,益处 283
Few-Shot Prompting, benefits 283
少镜头提示的局限性 283
Few-Shot Prompting, limitations 283
G
G
GenAI,进展 3
GenAI, advancements 3
GenAI 代理 29
GenAI Agent 29
GenAI 的能力
GenAI, capabilities
智能体人工智能 267
Agentic AI 267
歧义/消除歧义 265
Ambiguity/Disambiguation 265
审议 264
Deliberation 264
人机协作 267
Human-AI Collaboration 267
学习通用性 266
Learning Generalizable 266
多模式融合 265
Multimodal Integration 265
提示工程/CoT推理 266
Prompt Engineering/CoT Reasoning 266
重新排序/元推理 266
Reranking/Meta-Reasoning 266
信任度/可解释性 265
Trust/Explainability 265
GenAI模型
GenAI, models
人工神经网络(ANN) 375
Artificial Neural Networks (ANNs) 375
分类 374
Classification 374
CNN 375
CNNs 375
预测 374
Forecasting 374
OCR 376
OCR 376
回归 374
Regression 374
分割 375
Segmentation 375
生成框架 176
Generation Framework 176
生成框架,架构 178
Generation Framework, architecture 178
生成框架,清单 179
Generation Framework, checklist 179
Generation Framework, ensuring 176, 177
生成框架概述
Generation Framework, outlines
文档/图像导入 178
Document/Image Ingestion 178
嵌入模型 178
Embedding Models 178
LLM 179
LLM 179
输出交付 179
Output Delivery 179
用户查询界面 178
User Query Interface 178
向量数据库 178
Vector Database 178
向量搜索 179
Vector Search 179
第七代发电系统
Generation System 7
生成系统,架构 8
Generation System, architecture 8
发电系统技术
Generation System, techniques
扩散模型 9
Diffusion Models 9
语言模型 9
Language Models 9
视觉模型 9
Vision Models 9
发电系统类型
Generation System, types
音频世代 9
Audio Generation 9
图像生成 9
Image Generation 9
文本生成 9
Text Generation 9
生成式人工智能(GenAI) 2
Generative AI (GenAI) 2
发电机部件 180
Generator Part 180
Generator Part, sources 180, 181
遗传算法(GA) 228
Genetic Algorithms (GA) 228
评分机制 143
Grading Mechanisms 143
评分机制,优势 144
Grading Mechanisms, advantages 144
分级机制,层
Grading Mechanisms, layers
自适应 RAG 143
Adaptive RAG 143
Agentic RAG 143
Agentic RAG 143
CRAG 143
CRAG 143
自残 143
Self-RAG 143
评分机制概述
Grading Mechanisms, outlines
答案质量评分器 148
Answer Quality Grader 148
幻觉检测分级器 146
Hallucination Detection Grader 146
检索相关性评分器 145
Retrieval Relevance Grader 145
图形处理单元 (GPU) 64
Graphics Processing Units (GPUs) 64
护栏,框架
Guardrails, frameworks
Azure AI 提示防护 28
Azure AI Prompt Shields 28
NVIDIA NeMo 28
NVIDIA NeMo 28
OpenAI 审核 API 28
OpenAI Moderation API 28
护栏,方法 27
Guardrails, methods 27
护栏类型
Guardrails, types
输入 26
Input 26
输出 26
Output 26
H
H
HITL,类型
HITL, types
端到端 124
End-to-End 124
多智能体 124
Multi-Agent 124
人机交互(HITL) 122
Human-In-The-Loop (HITL) 122
我
I
互动 130
Interaction 130
交互类型
Interaction, types
完整 131
Full 131
132年后期
Late 132
第 131号
No 131
L
L
LLM评估 393
LLM Evaluation 393
LLM评估方法 394
LLM Evaluation, methods 394
LLM评估,第 393阶段
LLM Evaluation, stages 393
法学硕士 182
LLMs 182
法学硕士,各部分
LLMs, sections
HistLLM 183
HistLLM 183
LLMRec 183
LLMRec 183
MMREC 183
MMREC 183
磨牙 183
Molar 183
偶然的 MLLM 183
Sarendipitous MLLM 183
LLM、用例
LLMs, use cases
基线模型开发 377
Baseline Model Development 377
数据特征 377
Data Characteristics 377
堆叠集成学习 378
Stacked Ensemble Learning 378
本地 GPU 65
Local GPU 65
本地 GPU 功能
Local GPU, capabilities
部署模式 67
Deployment Patterns 67
硬件要求 66
Hardware Requirements 66
模型文件 67
Model Files 67
性能技巧 67
Performance Tips 67
软件 67
Software 67
本地 GPU,配置 66
Local GPU, configuring 66
M
M
Mathematical Reasoning 274, 275
密斯特拉尔 364
Mistral 364
米斯特拉尔,整合了 364
Mistral, integrating 364
MLflow 414
MLflow 414
MLflow,确保 414
MLflow, ensuring 414
MLLM,配置 182
MLLM, configuring 182
机器学习模型集成 387
ML Model Integration 387
Model Context Protocols (MCP) 31, 32
多文档查询 94
Multi-Document Query 94
多文档查询,初始化 94
Multi-Document Query, initializing 94
多索引嵌入 202
Multi-Index Embedding 202
Multi-Index Embedding, configuring 202-204
多模态 GenAI 系统 40
Multimodal GenAI System 40
多模态GenAI系统,类别
Multimodal GenAI System, categories
图像系统 57
Image Systems 57
图像转文本 55
Image-to-Text 55
文字和图像 56
Text and Image 56
文本转代码 59
Text-to-Code 59
文本转图像 54
Text-to-Image 54
文本转 SQL 58
Text-to-SQL 58
Multimodal GenAI System, illustrating 50, 51
多模态GenAI系统,步骤
Multimodal GenAI System, steps
嵌入生成 41
Embedding Generation 41
知识库 42
Knowledge Base 42
响应生成 42
Response Generation 42
结果返回 43
Result Returning 43
检索结果汇总 42
Retrieved Results Consolidation 42
用户查询提交 41
User Query Submission 41
向量数据库搜索 42
Vector Database Search 42
多模态LLM(MLLM) 181
Multimodal LLM (MLLM) 181
多模态 RAG 系统 229
Multimodal RAG System 229
多模态 RAG 系统,架构 236
Multimodal RAG System, architecture 236
多模式 RAG 系统,流量
Multimodal RAG System, flow
自适应嵌入 239
Adaptive Embedding 239
上下文汇编/语言生成 238
Context Assembly/Language Generation 238
索引行为 239
Indexing Behavior 239
两阶段回收 238
Two-Stage Retrieval 238
向量嵌入流程 237
Vector Embedding Pipeline 237
多模态 RAG 系统,图示 230
Multimodal RAG System, illustrating 230
Multimodal RAG System, initializing 231-235
多模态推理 277
Multimodal Reasoning 277
多模态检索 199
Multimodal Retrieval 199
多模态检索策略 199
Multimodal Retrieval, strategies 199
多模态检索系统 156
Multimodal Retrieval System 156
多模态检索系统及其应用
Multimodal Retrieval System, applications
内容发现 160
Content Discovery 160
医学影像 159
Medical Imaging 159
多模态问答 159
Multimodal QA 159
视觉产品搜索 159
Visual Product Search 159
Multimodal Retrieval System, architecture 156, 157
多模态检索系统及其组件
Multimodal Retrieval System, components
文档分块 158
Document Chunking 158
图像模态 157
Image Modalities 157
查询编码 158
Query Encoding 158
结果映射/响应生成 159
Result Mapping/Response Generation 159
技术改进 159
Technical Enhancement 159
用户交互/查询接收 157
User Interaction/Query Intake 157
向量商店集成 158
Vector Store Integration 158
多模式系统 154
Multimodal Systems 154
多模式系统,实施 155
Multimodal Systems, implementing 155
多模式系统,章节
Multimodal Systems, sections
图像到图像 155
Image-to-Image 155
图像转文本 155
Image-to-Text 155
文本转图像 154
Text-to-Image 154
文本到规格 155
Text to Specs 155
多模态向量嵌入 43
Multimodal Vector Embedding 43
Multimodal Vector Embedding, architecture 43, 44
多模态向量嵌入,查询
Multimodal Vector Embedding, queries
多个收藏集 49
Multiple Collections 49
单册收藏 48
Single Collection 48
多模态向量嵌入解决方案
Multimodal Vector Embedding, solutions
收藏品 45
Collections 45
多模态向量数据库 44
Multimodal Vector Database 44
有效载荷 45
Payload 45
点 ID 45
Point IDs 45
存储/矢量存储 46
Storage/Vector Store 46
向量 45
Vectors 45
多级 RAG 140
Multi-Stage RAG 140
多阶段 RAG,受益 141
Multi-Stage RAG, benefits 141
多级 RAG 组件
Multi-Stage RAG, components
混合检索 140
Hybrid Retrieval 140
迭代反馈 141
Iterative Feedback 141
多模态检索 140
Multimodal Retrieval 140
查询扩展/优化 140
Query Expansion/Refinement 140
重新排名阶段 140
Reranking Stage 140
验证/事实核查 141
Validation/Fact-Checking 141
多阶段 RAG,实施 151
Multi-Stage RAG, implementing 151
多阶段 RAG 概述
Multi-Stage RAG, outlines
自适应 142
Adaptive 142
Agentic 142
Agentic 142
分支 142
Branched 142
修正 142
Corrective 142
假设文档嵌入(HyDE) 142
Hypothetical Document Embedding (HyDE) 142
自我 142
Self 142
简单 141
Simple 141
简单记忆 141
Simple Memory 141
多阶段 RAG,阶段
Multi-Stage RAG, stage
第140代
Generation 140
检索 140
Retrieval 140
多向量表示 133
Multi-Vector Representation 133
Multi-Vector Representation, configuring 133, 134
Multi-Vector Representation, ensuring 134, 135
哦
O
可观测性 406
Observability 406
可观测性、平台
Observability, platforms
阿齐尔凤凰 407
Azire Phoenix 407
朗福斯 406
Langfuse 406
MLflow 407
MLflow 407
WhyLabs 407
WhyLabs 407
可观测性工具
Observability, tools
InspectorRAGet 408
InspectorRAGet 408
OTEL仪器 407
OTEL Instrumentations 407
RAGViz 407
RAGViz 407
OCR 350
OCR 350
OCR,架构 354
OCR, architecture 354
OCR概念
OCR, concepts
密斯特拉尔 364
Mistral 364
收据数据 366
Receipt Data 366
正则表达式上下文 365
Regex Context 365
OCR,图示 352
OCR, illustrating 352
OCR术语
OCR, terms
生成智能 361
Generate Intelligent 361
基于图像的输入 355
Image-Based Inputs 355
购物协助 355
Shopping Assistance 355
奥拉玛,能力
Ollama, capabilities
AutoGPTQ 69
AutoGPTQ 69
GPT4全部 69
GPT4All 69
LM Studio 69
LM Studio 69
文本生成 Web UI 69
Text Generation Web UI 69
Unsloth 69
Unsloth 69
Ollama With PDF Document, preventing 71-73
OpenAI 88
OpenAI 88
OpenAI API 88
OpenAI API 88
OpenAI API,类别 89
OpenAI API, categories 89
OpenAI API,功能 89
OpenAI API, functionalities 89
OpenAI API 用例
OpenAI API, use cases
访问模型 90
Accessing Models 90
OpenAI 主要模型 89
Major OpenAI Models 89
右图 91
Right Model 91
OpenAI,分析
OpenAI, breakdown
生成式响应式 186
Generative Responsive 186
分级/生成模型 189
Grading/Generation Models 189
进口声明 186
Import Statements 186
检索相关性 187
Retrieval Relevance 187
OpenAI 组件
OpenAI, components
生成反应 186
Generative Response 186
检索相关性 186
Retrieval Relevance 186
OpenAI,历史 88
OpenAI, history 88
OpenAI,各部分
OpenAI, sections
链条组件 102
Chain Assembly 102
配置 97
Configuration 97
会话记忆 102
Conversational Memory 102
依赖关系 103
Dependencies 103
文档加载/分块 99
Document Load/Chunking 99
混合寻回犬 100
Hybrid Retriever 100
初始化嵌入 97
Initialization Embedding 97
语言模型 入门
Language Model 101
主控制器 96
Main Controller 96
元数据标记 99
Metadata Tagging 99
提示模板 101
Prompt Template 101
矢量图商店 98
Vector Store 98
管弦乐 15
Orchestration 15
管弦乐编排,术语
Orchestration, terms
RAG 系统 15
RAG Systems 15
P
P
提示,架构 282
Prompting, architecture 282
提示,场景 286
Prompting, scenarios 286
提示技巧
Prompting, techniques
少枪 283
Few-Shot 283
零射击 282
Zero-Shot 282
拉
R
RAG 应用
RAG, applications
文档质量保证系统 15
Document QA Systems 15
企业聊天机器人 15
Enterprise Chatbots 15
知识管理 15
Knowledge Management 15
个性化人工智能助手 15
Personalized AI Asssistants 15
基于 RAG 的推荐系统 408
RAG-Based Recommendation System 408
扎根感 11
Groundedness 11
幻觉 11
Hallucination 11
州知识 11
State Knowledge 11
RAG,组件 73
RAG, components 73
RAG概念
RAG, concepts
对话缓冲区内存 81
Conversation Buffer Memory 81
混合/语义搜索 79
Hybrid/Semantic Search 79
LangChain 76
LangChain 76
元数据、嵌入 77
Metadata, embeddings 77
自然语言生成 81
Natural Language Generation 81
PDF文档 75
PDF Document 75
QA链 82
QA Chain 82
ReAct 提示 81
ReAct Prompt 81
用户聊天循环 83
User Chat Loop 83
RAG 评估 394
RAG Evaluation 394
RAG 评估,原因
RAG Evaluation, cause
卓越奖 395
Distinction 395
GenAI Ops 395
GenAI Ops 395
输出质量保证 395
Output Quality Ensuring 395
RAG 评估,层
RAG Evaluation, layers
世代/扎根性 394
Generation/Groundedness 394
管道级指标 395
Pipeline-Level Metrics 395
检索质量 394
Retrieval Quality 394
RAG/LLM 评估条款
RAG/LLM Evaluation, terms
反馈回路 396
Feedback Loops 396
幻觉漂移,监测 395
Hallucinations Drift, monitoring 395
检索质量,评估 396
Retrieval Quality, evaluating 396
版本控制/可追溯性 396
Version Control/Traceability 396
RAGOps 397
RAGOps 397
RAGOps,场景
RAGOps, scenarios
在开发过程中 397
During Development 397
后期开发 400
Post-Development 400
RAG Pipeline 库
RAG Pipeline, libraries
羊驼指数 407
LlamaIndex 407
拉加斯 407
Ragas 407
RAG管道概述
RAG Pipeline, outlines
背景准备 12
Context Preparation 12
第十二代
Generation 12
输出交付 12
Output Delivery 12
查询理解 12
Query Understanding 12
检索 12
Retrieval 12
RAG,步骤
RAG, steps
第十二代
Generation 12
检索 11
Retrieval 11
RAG技术
RAG, techniques
记忆增强 14
Memory-Augmented 14
多模态 14
Multimodal 14
重新排名 14
Reranking 14
RAG术语
RAG, terms
迭代 13
Iterative 13
快捷工程 14
Prompt Engineering 14
向量数据库 13
Vector Databases 13
RAG,类型
RAG, types
单阶段 12
Single-Stage 12
拖车阶段 13
Tow-Stage 13
实时零售情报 330
Real-Time Retail Intelligence 330
实时零售情报概述
Real-Time Retail Intelligence, outlines
延迟决策 330
Delayed Decisions 330
查询延迟 331
Query Latency 331
收入影响 331
Revenue Impact 331
数据孤岛 330
Siloed Data 330
推理 268
Reasoning 268
推理,基准 278
Reasoning, benchmark 278
推理类型
Reasoning, types
溯因推理 269
Abductive 269
模拟 270
Analogical 270
因果关系 271
Causal 271
常识 270
Commonsense 270
演绎推理 268
Deductive 268
归纳法 268
Inductive 268
数学 274
Mathematical 274
多模态 277
Multimodal 277
空间 272
Spatial 272
时间 273
Temporal 273
基于工具的 275
Tool-Based 275
收据数据 366
Receipt Data 366
Receipt Data, demonstrating 367-369
建议阶段 293
Recommendation Stage 293
建议阶段,概述了 294 项内容
Recommendation Stage, outlines 294
推荐阶段,步骤 296
Recommendation Stage, steps 296
Recommendation Stage, workflow 297-299
正则表达式上下文 365
Regex Context 365
正则表达式上下文,图示 366
Regex Context, illustrating 366
重新排名 23、24、194、195
Reranking, architecture 195, 196
重新排名,类别
Reranking, categories
交叉编码器 196
Cross-Encoder 196
混合动力 197
Hybrid 197
后期互动 196
Late Interaction 196
排序学习 197
Learning-to-Rank 197
基于LLM的 197
LLM-Based 197
Reranking, illustrating 286, 287
重新排序,模块
Reranking, module
embedding_utils.py 288
embedding_utils.py 288
index_builder.py 289
index_builder.py 289
langgraph_agent.py 290
langgraph_agent.py 290
加载器.py 287
loaders.py 287
reranker.py 289
reranker.py 289
检索增强生成(RAG) 11
Retrieval-Augmented Generation (RAG) 11
检索管道,第 409阶段
Retrieval Pipeline, phase 409
检索管道,术语
Retrieval Pipeline, terms
代理控制回路 410
Agentic Control Loop 410
Agentic RAG 设计 410
Agentic RAG Design 410
检索系统,架构 214
Retrieval System, architecture 214
检索系统,挑战
Retrieval System, challenges
自适应指数 227
Adaptive Index 227
上下文过滤 226
Contextual Filtering 226
嵌入归一化 225
Embedding Normalization 225
遗传算法 227
Genetic Algorithms 227
基于模态的路由 224
Modality-Based Routing 224
查询扩展 224
Query Expansion 224
得分融合 226
Score Fusion 226
加权嵌入融合 225
Weighted Embedding Fusion 225
检索系统,演化
Retrieval System, evolutions
混合检索 7
Hybrid Retrieval 7
学习检索(LTR) 7
Learning-to-Retrieve (LTR) 7
记忆增强型 7
Memory-Augmented 7
多向量表示法 7
Multi-Vector Representation 7
检索器-生成器融合(RAG) 7
Retriever-Generator Fusion (RAG) 7
检索系统及其局限性
Retrieval System, limitations
情境意识 216
Contextual Awareness 216
指数陈旧度 216
Index Staleness 216
有限语义 215
Limited Semantic 215
模态不匹配 215
Modality Mismatch 215
精准权衡 215
Precision Trade-Offs 215
效率低下排名 216
Ranking Inefficiencies 216
检索系统,技术
Retrieval System, techniques
自适应指数 223
Adaptive Index 223
嵌入归一化 219
Embedding Normalization 219
混合检索 220
Hybrid Retrieval 220
基于模态的路由 218
Modality-Based Routing 218
多索引嵌入 218
Multi-Index Embedding 218
查询扩展 219
Query Expansion 219
重新排名 221
Reranking 221
检索系统,类型
Retrieval System, types
密集 5
Dense 5
稀疏 5
Sparse 5
S
S
Software Development/Ops, comparing 412, 413
基于 Streamlit 的前端 246
Streamlit-Based Frontend 246
基于 Streamlit 的前端,实现 了247、248版本
Streamlit-Based Frontend, implementing 247, 248
STT/TTS 243
STT/TTS 243
T
T
文本转 SQL 302
Text-to-SQL 302
文本转 SQL 的挑战
Text-to-SQL, challenges
歧义 307
Ambiguity 307
数据隐私/治理 310
Data Privacy/Governance 310
领域泛化 308
Domain Generalization 308
反馈回路 310
Feedback Loops 310
多回合互动 309
Multi-Turn Interaction 309
查询执行 309
Query Execution 309
模式对齐/链接 308
Schema Alignment/Linking 308
SQL 语法 309
SQL Syntax 309
用户意图消歧义 310
User Intent Disambiguation 310
Text-to-SQL, configuring 302-305
文本转 SQL,域名
Text-to-SQL, domains
商业智能/分析 305
BI/Analytics 305
对话式界面 306
Conversational Interfaces 306
金融服务/风险监控 307
Financial Services/Risk Monitoring 307
医疗保健/临床信息学 306
Healthcare/Clinical Informatics 306
人力资源 307
Human Resources 307
物联网运营 307
IoT Operations 307
运营分析 306
Operations Analytics 306
零售/电子商务个性化 306
Retail/E-commerce Personalization 306
SQL学习 306
SQL Learning 306
Text-to-SQL, illustrating 311-313
文本到 SQL 管道 331
Text-to-SQL Pipeline 331
Text-to-SQL Pipeline, architecture 334, 335
文本到 SQL 管道的概念
Text-to-SQL Pipeline, concepts
代理模块 336
Agent Modules 336
前端界面 338
Frontend Interface 338
索引初始化 338
Index Initialization 338
基础设施层 337
Infrastructure Layer 337
主执行层 336
Main Execution Layer 336
任务导向型 337
Task-Oriented 337
Text-to-SQL Pipeline, embedding 341, 342
文本到 SQL 管道,实体
Text-to-SQL Pipeline, entity
生成 SQL 查询 343
Generate SQL Query 343
SQL 查询成绩 344
SQL Query Grade 344
总结等级 345
Summary Grade 345
文本到 SQL 管道,说明 335
Text-to-SQL Pipeline, instructions 335
Text-to-SQL Pipeline, integrating 339, 340
Text-to-SQL Pipeline, steps 332, 333
Text-to-SQL, practices 327, 328
文本转 SQL,部分
Text-to-SQL, sections
组件级 325
Component-Level 325
精确匹配准确率 324
Exact Match Accuracy 324
执行准确率 324
Execution Accuracy 324
人类评估 326
Human Evaluation 326
查询执行成功 325
Query Execution Success 325
语义等价 326
Semantic Equivalence 326
吞吐量指标 326
Throughput Metrics 326
分词 17
Tokenization 17
代币化,确保 18
Tokenization, ensuring 18
分词类型
Tokenization, types
字节级 18
Byte-Level 18
角色等级 18
Charcter-Level 18
子词级别 18
Subword-Level 18
词汇等级 17
Word-Level 17
代币使用率 144
Token Utilization 144
代币使用情况,条款
Token Utilization, terms
输入上下文 144
Input Contexts 144
中间总结 144
Intermediate Summarization 144
长篇第 144代
Long-Form Generation 144
两阶段 RAG 138
Two-Stage RAG 138
两阶段 RAG,架构 138
Two-Stage RAG, architecture 138
两阶段 RAG 的原因
Two-Stage RAG, reasons
集成鲁棒性 139
Ensemble Robustness 139
世代对齐 139
Generation Alignment 139
富达排名 第139位
Ranking Fidelity 139
两级 RAG 系统 135
Two-Stage RAG Systems 135
两阶段 RAG 术语
Two-Stage RAG, terms
一次密集检索 138
One Dense Retrievals 138
语义精确度 138
Semantic Precision 138
V
V
向量数据库 19
Vector Database 19
向量数据库,架构 20
Vector Database, architecture 20
向量数据库,确保 23
Vector Database, ensuring 23
向量数据库,操作
Vector Database, operations
嵌入模型 22
Embedding Models 22
索引算法 20
Indexing Algorithms 20
搜索算法 21
Search Algorithms 21
视觉语言模型(VLM) 34
Vision-Language Models (VLMs) 34
VLM,架构 50
VLMs, architecture 50
VLMs,病例 52
VLMs, cases 52
VLM 的挑战
VLMs, challenges
数据要求 39
Data Requirements 39
效率/延迟 40
Efficiency/Latency 40
跨领域泛化 40
Generalization Across Domains 40
缺乏整合 40
Lack Integration 40
有限多模态推理 39
Limited Multimodal Reasoning 39
模式失衡 39
Modality Imbalance 39
VLM,类型
VLMs, types
生成式 35
Generative 35
指令调整 36
Instruction-Tuned 36
多模态推理 36
Multimodal Reasoning 36
检索导向型 35
Retrieval-Focused 35
VQA/字幕 35
VQA/Captioning 35
Voice-Enabled Pipeline 248, 249
Voice-Enabled Pipeline, interacting 249-253
语音控制的 RAG 245
Voice-Enabled RAG 245
语音赋能的 RAG 问题
Voice-Enabled RAG, concerns
基于 Streamlit 的前端 246
Streamlit-Based Frontend 246
泰克堆栈 246
Teck Stack 246
语音赋能管道 248
Voice-Enabled Pipeline 248
X
X
XGBoost Pipeline 379
XGBoost Pipeline 379
XGBoost Pipeline, illustrating 380, 381
XGBoost Pipeline,条款
XGBoost Pipeline, terms
代理编排 386
Agent Orchestration 386
FastAPI 推理 385
FastAPI Inference 385
FastAPI 服务层 385
FastAPI Serving Layer 385
LangChain 代码 383
LangChain Code 383
Z
Z
零样本提示 282
Zero-Shot Prompting 282
零样本提示,益处 283
Zero-Shot Prompting, benefits 283
零样本提示的局限性 283
Zero-Shot Prompting, limitations 283